Text Corpora and Multilingual Lexicography

Text Corpora and Multilingual Lexicography Benjamins Current Topics Special issues of established journals tend to ci...

Author: Wolfgang Teubert

78 downloads 1572 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Text Corpora and Multilingual Lexicography

Benjamins Current Topics Special issues of established journals tend to circulate within the orbit of the subscribers of those journals. For the Benjamins Current Topics series a number of special issues have been selected containing salient topics of research with the aim to widen the readership and to give this interesting material an additional lease of life in book format.

Volume 8 Text Corpora and Multilingual Lexicography Edited by Wolfgang Teubert These materials were previously published in International Journal of Corpus Linguistics 6, Special Issue (2001)

Text Corpora and Multilingual Lexicography

Edited by

Wolfgang Teubert University of Birmingham

John Benjamins Publishing Company Amsterdam / Philadelphia

8

TM

The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.

Library of Congress Cataloging-in-Publication Data Text corpora and multilingual lexicography / edited by Wolfgang Teubert. p. cm. -- (Benjamins current topics, ISSN 1874-0081 ; v. 8) Includes bibliographical references and index. 1. Lexicography--Data processing. 2. Corpora (Linguistics) I. Teubert, Wolfgang. P327.5.D37T49 2007 413'.028--dc22 2007009684 isbn 978 90 272 2238 1 (Hb; alk. paper)

© 2007 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa

JB[v.20020404] Prn:31/05/2007; 9:26

F: BCT8CO.tex / p.1 (48-128)

Table of contents

Preface Automatic extraction of terminological translation lexicon from Czech-English parallel texts ˇ Martin Cmejrek and Jan Cuˇrín Words from Bononia Legal Corpus R. Rossini Favretti, F. Tamburini, and E. Martelli

vii

1

11

Hybrid approaches for automatic segmentation and annotation of a Chinese text corpus Zhiwei Feng

31

Distance between languages as measured by the minimal-entropy model: Plato’s Republic – Slovenian versus 15 other translations Primož Jakopin

39

The importance of the syntagmatic dimension in the multilingual lexical database R¯uta Marcinkeviˇcien˙e

49

Compiling parallel text corpora: Towards automation of routine procedures Mihail Mihailov and Hannu Tommola

59

Data-derived multilingual lexicons John McH. Sinclair

69

Bridge dictionaries as bridges between languages Hana Skoumalová

83

Procedures in building the Croatian-English parallel corpus Marko Tadi´c

93

JB[v.20020404] Prn:31/05/2007; 9:26



F: BCT8CO.tex / p.2 (128-144)

Table of contents

Corpus linguistics and lexicography Wolfgang Teubert

109

Analysing the fluency of translators Rafał Uzar and Jacek Wali´nski

135

Equivalence and non-equivalence in parallel corpora Tamás Váradi and Gábor Kiss

147

Index

157

JB[v.20020404] Prn:14/05/2007; 10:00

F: BCT8PR.tex / p.1 (45-136)

Preface

I am very glad to see this special issue of the International Journal of Corpus Linguistics on multilingual applications of corpus research republished. All the contributions in this slim volume, reprinted here without any changes, evolved from the EU-funded project Trans-European Language Resources Infrastructure (TELRI), which was kicked off in 1995 and ended in 2002. The aspiration behind this venture was to bring together researchers, linguists and computer scientists from all over Europe who share an interest in multilingual language processing. There had been little communication between East and West as long as there was an Iron Curtain. Even when it finally fell, few researchers in Western Europe had a concrete idea or even took an interest in what had been and was happening the Central and Eastern European countries, unaware of the truly enormous potential hidden there. TELRI was truly multilateral. It comprised teams in over twenty countries, including Russia and even China. All teams were striving towards the compilation of corpora of real language data and were searching for ways of using these monolingual, bilingual and multilingual resources for the development of human language technology. TELRI, I am very glad to say, was successful in organising a durable network of academic researchers working on these issues. This cooperation is desperately needed, if we do not want to let the unresolved issue of multilinguality stand in the way of the idea of Europe. Europe stands for unity in diversity, ethnic, cultural and linguistic diversity. The more integrated Europe becomes the more it will be a fact of life for many of us to communicate with people whose native language is not ours. Teaching more foreign languages, necessary as it is, cannot be the only solution. The other option, namely that we all switch to English as the sole lingua franca, is met only with muted enthusiasm on the continent. To make multilinguality work we need tools facilitating the task of translating and of writing texts in non-native languages. It is true that we have been told for the last fifty years that machine translation able to deal with the kind of everyday language we find in newspapers is just around the corner. It is still there, it seems. It could hardly be else. The traditional approach to machine translation just does not work. The grandiose failure of the well-funded EUROTRA project, designed to translate between all the then eleven official languages of the European Community, is emblematic.

JB[v.20020404] Prn:14/05/2007; 10:00

F: BCT8PR.tex / p.2 (136-189)

 Preface

The strange thing is that this failure did not lead to experts to abandon this approach. We find it again, a decade later, in the EU-funded EuroNet project, aimed at setting up a multilingual conceptual ontology. Such an ontology, the idea is, would facilitate translating between the languages involved. It is based on the vision of an interlingua, a formal representation of concepts, claiming to use, in its definitions, expressions of a formal language independent of all natural languages. It is the idea of a conceptual ontology, a network of all possible concepts in which all the possible relationships obtaining between them are specified. Behind it is the dream of a perfect language, a language in which each concept refers to some ‘real’ thing in a ‘real world’, and that, as these things exist independently of us and our language, we can use the concepts denoting them as the Archimedean point of translation between the imperfect vernaculars we normally use. Thus we can, for instance, posit a concept ‘father’ and describe its relationship with the concepts ‘son’ and ‘daughter’, and the concepts of ‘husband’ and ‘wife’, and, extending the network, more concepts such as ‘brother’, ‘sister’, ‘uncle’, ‘aunt’ and so on. A complete multilingual ontology would contain concepts onto which all the words with all their senses in all the natural languages could be mapped. There are a few problems, though. How can we design a formal definition language without using a natural language? If dictionaries cannot agree about the number and meanings of the senses a word has, how do we identify concepts? If words have an identifiable meaning only in virtue of the other words in whose context they occur, how then can we know what concepts mean in isolation? Do concepts always have the same meaning and reference regardless of the context in which they recur? Should we view a phrase such as civilised world as a combination of two concepts or as a single concept? The EuroNet project took, as its conceptual ontology, a slimmed-down version of the Princeton WordNet, which has not been designed as a language-independent ontology, but as an online English language dictionary. The word senses described in this WordNet dictionary are given the status of concepts. There are now, we are told, compatible WordNets for the other languages involved, the ensuing ensemble amounting to a multilingual conceptual ontology. It is a nice idea, but does it work? As far as I know, no translator actually uses it. Translation does not work that way. TELRI advocated a different approach. Instead of inventing language-independent ontologies, it looked at the issue of translation equivalence from the perspective of real language data, i.e. of parallel corpora containing original texts in one language and its translations in other languages. This meant replacing the top-down approach of systematic linguistics by the bottom-up approach of parole-linguistics. Translation equivalence is not something that exists independently of translations, and of the translators who carry them out. A translation equivalent is not something a translator discovers but something she or he invents. If there

JB[v.20020404] Prn:14/05/2007; 10:00

F: BCT8PR.tex / p.3 (189-220)

Preface

has never been a text translated from language A into language B, let’s say from Samian, spoken in the north of Sweden and Norway, into Romantsch, spoken in some remote Swiss valleys, it is up to the first translator to find the Romantsch equivalents for the Samian words. The question then is whether they will be accepted by other people fluent in both languages. They might accept some choices but reject others. Subsequent translations will come up with different equivalents. There is nowhere an Archimedean point to determine correctness. Even if at some point some philanthropic publisher would fund a Samian-Romantsch dictionary, this dictionary could be based on nothing but the existing translations. Bilingual dictionaries are, in the best of cases, repositories of the evidence gleaned from a comparison between original texts and their target language versions. There is nothing else that would tell us about translation equivalence. But unfortunately bilingual dictionaries record more faithfully the preconceptions of their compilers than the empirical evidence. This individual competence may be fully appropriate at times, but also can be mistaken. A sufficiently large parallel corpus would give us the most common, and therefore the most acceptable, equivalents for a source language expression. One of the achievements of TELRI was the compilation of a multilingual parallel corpus, consisting of more than a dozen translations of Plato’s Republic, many of which were aligned on the sentence level. Some teams also began to develop software aiming at alignment on the lexical level. Much remains to be done, though. Little funding is available for this kind of bottom-up research into translation equivalence. It is still the case that much more money is invested in ontology building. But there can be little doubt that the bilingual dictionary and the machine translation lexicon of the future will be based on the evidence extracted from parallel corpora. Instead of a language-neutral ontology of single concepts in isolation, it will be words embedded in their unique contexts that will resolve semantic ambiguity, machine translation’s worst enemy. However this approach has its price. Compiling parallel corpora takes a lot of effort, and extracting translation equivalents from them will take even more. Yet it is the only workable alternative to the traditional machine translation philosophy. The contributions in this book discuss various aspects of this new and exiting enterprise. While they do not provide ready solutions, they offer much food for thought. I hope that by making them available to a wider public the corpus approach top translation equivalence will find new friends. Wolfgang Teubert



JB[v.20020404] Prn:15/02/2007; 12:54

F: BCT801.tex / p.1 (46-114)

Automatic extraction of terminological translation lexicon from Czech-English parallel texts ˇ Martin Cmejrek and Jan Cuˇrín We present experimental results of an automatic extraction of a Czech-English translation dictionary. Two different bilingual corpora (119,886 sentence pairs computer-oriented and 58,137 journalistic corpora) were created. We used the length-based statistical method for sentence alignment (Gale and Church 1991) and noun phrase marker working with regular grammar and probabilistic model (Brown et al. 1993) for dictionary extraction. Resulting dictionaries’ size varies around 6,000 entries. After significance filtering, weighted precision is 86.4% for computer-oriented and 70.7% for journalistic Czech-English dictionary.

Introduction The primary motivation for our research was to create a translation lexicon of terminology of a particular discipline. Many disciplines lack relevant dictionaries or the dictionaries are obsolete because of the rapid development of the discipline. We assume that the fundamental part of the translation lexicon could be generated from the parallel corpora of texts translated up to now automatically and afterwards it could be manually edited. We have followed the work in the field of automatic sentence alignment (Gale and Church 1993) and combined the method of automatic extraction of a translation dictionary based on a model of word-to-word translation probabilities (Brown et al. 1993, Wu and Xia 1994) with a noun phrase extractor. The paper has five sections. In the first section, we will describe the material from which the translation lexicon was extracted, that is the English-Czech corpus. Section 2 is devoted to the statistical alignment of corresponding paragraphs and sentences in this corpus. In Section 3, we will focus on the noun phrases. The noun phrases in the aligned pairs of sentences were marked using regular grammarbased tools. This output was used for the training of the statistical model of translation. We will deal with this model in Section 4. This procedure resulted in a

JB[v.20020404] Prn:15/02/2007; 12:54



F: BCT801.tex / p.2 (114-164)

ˇ Martin Cmejrek and Jan Cuˇrín

translation lexicon, and this lexicon was processed afterwards by several automatic filters. This will be shown in Section 5.

English-Czech parallel corpora Work in the field of automatic sentence alignment (Gale and Church 1993) and automatic extraction of translation dictionaries (Brown et al. 1993, Wu and Xia 1994) has exploited very large corpora of parallel texts from parliaments in bilingual countries such as Canada and Hong Kong. The first two (Gale and Church, Brown et al.) used the Canadian Hansards English-French Corpus; the third used the HKUST English-Chinese Corpus. These corpora are very large (around 2 mil and 0.4 mil pairs of sentences) and primarily contain highly equivalent, literal, and tight translations. The situation in our country is different. We lack such a good source of large bilingual data. We used a smaller corpus of texts taken from a particular discipline, a computer-oriented corpus. The corpus consists of operating system messages from IBM AIX and of operating system guides for IBM AS/400 and VARP 4. The translations are literal and tight. In most cases, sentences are translated sentence by sentence. It means that there is a one-to-one correspondence between an English sentence and a Czech sentence. On the other hand, it is a typical feature of this kind of text that the majority of operating system messages and a majority of sentences from guides do not have a verb. This corpus contains 119,886 pairs of sentences. We also have access to data from Reader’s Digest Výbˇer magazine. Thirty to sixty percent of the articles in this magazine have been translated from English to Czech. The translations in Reader’s Digest are mostly very free. This corpus contains 58,137 pairs of sentences. The experiments were also carried out on this corpus, and the results were compared to those obtained from computer-oriented corpus. Both corpora are automatically morphologically tagged by BH tagging tools (Hajiˇc and Hladká 1998).

Statistical alignment of paragraphs and sentences in English-Czech parallel corpora For the subsequent training procedure, we needed to automatically identify matching sentences between both languages. There are two main approaches, lexical and statistical. Many works use one of them or combine both of them. Lexical approaches are based on bilingual dictionaries while the statistical ones use simple probabilistic models usually based on the length of aligned sentences, mostly on

JB[v.20020404] Prn:15/02/2007; 12:54

F: BCT801.tex / p.3 (164-258)

Automatic extraction of terminological translation lexicon

the number of characters. English-Czech dictionary was available, so we decided to use the length-based approach described by (Gale and Church 1993): Pr(Le ⇔ Lc | Te , Tc ) ≈ Pr(Le ⇔ Lc | le , lc ), where Te and Tc are the corresponding paragraphs; Le and Lc are obviously 0 .. 2 sentences from the respective paragraphs; le , lc are lengths of Le and Lc . We considered six types of matching sentences: 1–1, 1–0, 0–1, 1–2, 2–1, 2–2. (Paragraphs were also aligned on the journalistic corpus.) The following table briefly summarises results of the automatic paragraph and sentence alignment. computer-oriented

journalistic corpus

1245780 1089813

959583 860757

88790 88790

19567 24874

#sentences (English) #sentences (Czech)

120743 121295

70872 67856

#aligned sentences

119886

64736

types of alignment: 1–1 0–1 (En/Cz) 1–0 (En/Cz) 1–2 (En/Cz) 2–1 (En/Cz) 2–2

117450 73 36 1397 882 48

#words (English) #words (Czech) #paragraphs (English) #paragraphs (Czech)

accuracy

96%

98% 0% 0% 1% 1% 0%

37039 5311 4454 9501 7342 1089

57% 8% 7% 15% 11% 2%

85%

Noun phrases identification We aim for a terminological dictionary, that is a dictionary also containing translation equivalents consisting of more than one word. For example, a single word “typewriter” in English is translated by a two-word noun phrase “psací stroj” in Czech. On the other hand, the noun phrase “construction worker” which consists of two words in English corresponds to a single word “stavbaˇr” in Czech. We suppose that terminology can be primarily found in noun phrases. Therefore our approximation consists on focusing on noun phrases only. Thus our approach does not cover phrasal verbs like set up, turn on/off, boot up, etc. I would like to reiterate here that many messages of the operation system do not have a verb at all. The identification of noun phrases is based on a simple regular grammar. Grammar rules can be easily modified. Our grammar identifies the most typical



JB[v.20020404] Prn:15/02/2007; 12:54



F: BCT801.tex / p.4 (258-351)

ˇ Martin Cmejrek and Jan Cuˇrín

noun phrases that consist of nouns, adjectives, and some auxiliary words. Only continuous sequences of words are considered to be noun phrases. Since parsing of a sentence is not always ambiguous, we chose a combination of noun phrases so that the phrases do not overlap; they cover the maximal count of words in a sentence, and the number of noun phrases in a sentence is the minimal. Once the noun phrases are identified, we can further process the data before we start the automatic extraction of the translation lexicon. The marked noun phrases are converted into basic form, that is definite and indefinite articles are removed from English phrases, and the head of a Czech noun phrase is transferred into nominative. In addition, Czech verbs are transferred into infinitive. The idea is to concatenate words that build the potential noun phrase into one string, that is to consider these constructions as single “words”. This will be the input for a standard statistical training procedure. Once the data are prepared, the translation dictionary training procedure can start.

Noun phrase grammar English

Czech

NP → N NP → A N NP → A A N NP → N Rof N NP → A N Rof N NP → N Rof A N NP → DT× N NP → DT× A N NP → DT× N Rof N NP → DT× A N Rof N NP → DT× N Rof A N NP → N Rof DT N NP → A N Rof DT N NP → N Rof DT A N NP → N Rof DT N NP → DT× N Rof DT N NP → DT× A N Rof DT N NP → DT× N Rof DT A N

NP → A* NP → A* N* NP → A* A* N* NP → A* N$ 2 NP → A* A$ N$ 2 NP → A* N* N$ 2 NP → E1 N* NP → A* E1 NP → A* E2 NP → A* E2 E2 NP → A* E3 NP → A* E3 E3 ... NP → A* E4 E4 E4 E4 NP → A* E5 NP → A* E5 E5 NP → A* E5 E5 E5 NP → E5 E5 E5 E5 E5

N N2 A Rof DT En

→ → → → → →

noun noun in accusative adjective preposition ‘of ’ definite or indefinite article non-translated English word in corresponding Czech sentence (n is the number of words in English phrase. Example: File Manager window / Czech: okno File Manager)

JB[v.20020404] Prn:15/02/2007; 12:54

F: BCT801.tex / p.5 (351-463)

Automatic extraction of terminological translation lexicon

Examples of selected noun phrases Computer-oriented corpus: En: Cz: En: Cz: En:

The device driver indicates a hardware failure of equipment. Ovladaˇc zaˇrízení zjistil technickou závadu pˇrístroje. &_device_driver_# indicates _hardware_failure_of_equipment_# . &_ovladaˇc_zaˇrízení_# zjistit &_technická_závada_pˇrístroje_# . The QUSRSYS library must install successfully before the INZSYS process is automatically started. Cz: Pˇred automatickým spuštˇením procesu INZSYS musí být úspˇešnˇe nainstalována knihovna QUSRSYS. En: &_QUSRSYS_library_# must install successfully before &_INZSYS_process_# is automatically started . Cz: pˇred &_automatické_spuštˇení_# &_proces_INZSYS_# muset být úspˇešnˇe nainstalovat &_knihovna_QUSRSYS_# .

Journalistic corpus: En: Just then, they saw cowboys coach Eddie Sutton walk toward the court with a man pushing a kid in a wheelchair. Cz: V tu chvíli zahlédli, jak na hˇrištˇe pˇrichází jejich trenér Eddi Sutton s mužem, který na invalidním vozíku vezl malého chlapce. En: just then they saw &_cowboys_coach_# Eddie Sutton walk toward &_court_# with a &_man_# pushing &_kid_# in &_wheelchair_# . Cz: v ten &_chvíle_# zahlédnout jak na &_hˇrištˇe_# pˇricházet jeho &_trenér_# Eddi Sutton s &_muž_# který na &_invalidní_vozík_# vézt &_malý_chlapec_# . En: Sutton explained that Scott had lost part of his leg to bone cancer. Cz: Sutton hráˇcum vysvˇetlil, že Scott ztratil cˇ ást levé nohy kvuli rakovinˇe kosti. En: sutton explained that scott had lost &_part_# of his &_leg_# to &_bone_cancer_# . Cz: sutton &_hráˇc_# vysvˇetlit že scott ztratit &_ˇcást_levé_nohy_# kvuli &_rakovina_kosti_# .

Translation dictionary training We implemented models 1 and 2 described in (Brown et al. 1993) of sentence translation probability and used iterative EM algorithm for maximizing the likelihood of generating the Czech translations from the English text. Model 1 le lc ε Prt (c |e ) = t ci ej lc le + 1 i=1 j=0 Model 2 Prt,a (c |e ) = ε

le a1 =0

...

le lc t ci eai a ai i, lc , le alc

i=0



JB[v.20020404] Prn:15/02/2007; 12:54



F: BCT801.tex / p.6 (463-524)

ˇ Martin Cmejrek and Jan Cuˇrín

Model 1 is based on a word-by-word translation probability t (c |e ) only and approximates the probability of translating the English sentence e into the Czech sentence c. Model 2 extends Model 1 by using word alignment probabilities a ai i, lc , le . The EM algorithm results in a probabilistic dictionary which assigns translation probability to every pair of Czech and English words which have been found together in corresponding sentences.

Significance filtering and the evaluation of results It is necessary to “clean up” the probabilistic dictionary by filtering out most of the translations to produce a useful dictionary. The principle of significant filtering is to find a combination of just a few filtering criteria that affects the quality of the representative sample of the dictionary (we included and examined about 4% of all entries) in the best way. This combination is used to filter the whole dictionary. Let us define following indicators: Recall = #marked correct translations , #marked translations Precision = #marked correct translations , #marked translations Share = total of probabilities of correct translations. We have established these criteria:

criterion

Description

Frst(n) Thd(p)

First n translations selected. Only translations accounting for the top of the treshold p are retained.

MC(m,n)

Works only for entries with occurence lower than m. Translations having count higher than n are excluded if they have not been selected as phrases. Translation probabilities for each entry are recomputed.

MPr(p)

Translations with the translation probability lower than p are excluded. Translation probabilities for each entry are recomputed. Works only for entries selected as phrases. Translations with translation probabilities lower than p are excluded if they have not been selected as phrases. Translation probabilities for each entry are recomputed.

MPr’(p)

JB[v.20020404] Prn:15/02/2007; 12:54

F: BCT801.tex / p.7 (524-645)

Automatic extraction of terminological translation lexicon

We have tested these combinations of filtering criteria: combination of criteria

computer-oriented corpus recall/precision/share (%)

Thd(0.85) ∼ input dictionary 100.0/38.0/63.6 Frst(1) 56.6/88.1/88.1 Thd(0.7) 94.1/49.8/69.8 Thd(0.7) → MPr(0.08) 82.4/60.6/73.1 Thd(0.7) → MPr(0.05) → 81.3/62.1/73.8 MPr(0.05) Thd(0.7) → MPr(0.05) → 81.3/68.5/78.4 MPr(0.05) → MPr’(0.3) MC(700,1800) → Thd(0.7) → 84.2/72.6/83.7 MPr(0.05) → MPr(0.09) → MPr’(0.3) Thd(0.7) → MC(700,1800) → 83.8/74.8/84.9 MPr(0.05) → MPr(0.09) → MPr’(0.3)

journalistic corpus recall/precision/share (%) 100.0/10.5/26.0 57.3/61.7/61.7 96.7/13.6/30.4 73.4/47.3/55.9 81.3/38.7/49.7 80.9/41.7/54.4 81.7/51.5/63.6 81.3/51.4/63.9

Sample of the resulting English-Czech dictionary – computer texts: manage [177] spravovat} (0.47), rˇídit (0.31), správa (0.22) managed [21] rˇídit (0.36), spravovat (0.27), program (0.18), server/400 (0.18) management* [37] management*, rˇízení* (0.22) manager’s maintenance operating* [10] operating (0.77), podrobnˇejší informace* (0.23) manager* [76] manager* (1.00) manager software operating* [13] operating (0.34), nalézt (0.32), SC19 (0.18), program* (0.16) manages [14] rˇídit (1.00) managing [87] rˇízení* (0.36), správa* (0.27), spravující stroje* (0.22), spravující stroj* (0.16) managing system* [13] rˇídicí systém* (1.00) manual* [105] manual* (0.44), manuál* (0.36), pˇríruˇcka* (0.21) manually [130] ruˇcnˇe (0.79), manuálnˇe (0.21) manuals* [11] pˇríruˇcky* (0.57), knihovna* (0.21), vyhledávání informací* (0.21) manual installation* [13] ruˇcní instalace* (1.00) manual installation process* [22] proces ruˇcní instalace$* (0.44), ruˇcní instalace* (0.41), proces* (0.15) manual install display* [10] obrazovka manual install* (0.60), objevit (0.40) manual IPL* [14] IPL (0.54), manuální (0.46) manual mode* [26] režim manual* (1.00) manufacturer* [10] výrobce* (0.83), zaˇrízení IBM* (0.17) many [404] mnoho (0.87), kolik (0.13) map [12] mapovat (0.51), AS/400 (0.28), datové typy* (0.21) map* [31] mapa* (0.68), map* (0.32)



JB[v.20020404] Prn:15/02/2007; 12:54



F: BCT801.tex / p.8 (645-760)

ˇ Martin Cmejrek and Jan Cuˇrín

mapped [22] mapovat (1.00) mapping* [45] mapování* (0.45), macintosh (0.30), pˇriˇrazení* (0.25) maps [19] mapy* (0.56), instalaˇcní (0.22), jeho (0.22) maps* [10] mapy* (0.62), aplikace* (0.38) margins* [13] okraje* (0.85), rˇádek* (0.15) mark [19] oznaˇcit} (1.00) mark* [18] oznaˇcit (0.54), znaˇcka* (0.46) marked [43] oznaˇcit (0.83), oznaˇcený (0.17) marketing representative* [62] obchodní zástupce* (1.00) marks [13] uvést (0.40), klíˇcové slovo* (0.40), uvozovky* (0.20) mask* [35] maska* (0.59), maska podsítˇe* (0.41) master* [13] master* (1.00) master installation list* [50] hlavní instalaˇcní formuláˇr* (1.00) match [177] odpovídat (0.87), souhlasit (0.13) match* [46] odpovídat (0.41), odpovídající protˇejšek* (0.31), shoda* (0.28) ˇ matched [11] odpovídat (0.23), za (0.23), nalezený (0.18), další pˇríkazy* (0.12), splnovat (0.12), vyhovˇet (0.12) matches [56] odpovídat (0.85), souhlasit (0.15) matching* [13] odpovídající* (0.63), odpovídat (0.37) material* [30] materiál* (1.00) materials* [11] materiály* (0.60), materiál* (0.40) matrix [16] matice* (1.00) max [41] max (0.79), maximálnˇe (0.21) maximum* [137] maximum* (0.52), maximálnˇe (0.48) maximum length* [18] maximální délka* (0.72), maximální délka parametru* (0.28)

Sample of the resulting English-Czech dictionary – journalistic corpus. rachot* [11] crashing (0.61), its (0.39) rada* [33] never (0.44), make (0.30), kids* (0.27) radˇeji [136] rather (0.53), better (0.29), prefer (0.18) radio* [18] radio* (1.00) radit [82] told (0.67), how (0.33) radnice* [13] city hall* (0.55), day* (0.45) radost* [67] joy* (0.74), happiness* (0.26) radovat [24] joy* (0.53), well (0.47) rady* [26] current article text* (0.58), advices* (0.42) rajˇcata* [11] tomatoes* (0.57), tomato (0.43) raketa* [10] rocket* (1.00) raketoplán* [11] shuttle* (1.00) rakety* [10] missiles* (0.59), missile* (0.41) rakev* [10] husband* (0.54), coffin* (0.46) rakovina* [96] cancer* (1.00) rakovina plic* [16] lung cancer* (1.00) rakovina prsu* [22] breast cancer* (0.74), cancer* (0.26)

JB[v.20020404] Prn:15/02/2007; 12:54

F: BCT801.tex / p.9 (760-855)

Automatic extraction of terminological translation lexicon

ralph [41] ralph (0.81), jaymee (0.19) ramena* [59] shoulders* (0.81), arm* (0.19) rameno* [70] shoulder (0.83), right (0.17) ranˇc* [22] ranch* (0.75), years* (0.25) raul [17] raul (0.85), dad* (0.15) ravussin [10] fat calories* (0.26), she’s (0.26), sanchez (0.25), carbohydrates (0.23) razit [14] quickly (0.55), wrong (0.45) rád [345] love (0.66), loved (0.19), don’t (0.15) rádius* [21] radio* (0.71), disc* (0.29) rámec* [11] companies* (0.55), endometrial (0.45) rána* [53] hole* (1.00) ráno* [57] morning* (0.69), wound* (0.16), girls* (0.15) rány* [19] wounds* (0.53), blows* (0.47) reader’s [10] german (0.34), again (0.22), american (0.22), those (0.22) reader [20] digest (0.39), reader’s (0.32), year* (0.15), nearly (0.15) reagovat [108] respond (0.69), react (0.31) reakce* [26] reaction* (0.71), something* (0.29) recept* [13] recipe* (0.57), prescription* (0.43) reeves [13] reeves (0.82), scott’s (0.18) reid [11] reid (0.84), golf balls* (0.16) reklama* [23] advertising* (1.00) rentgen* [10] rays* (0.50), x (0.50) republikáni* [11] gingrich (0.60), showing (0.20), divorce* (0.20) respektovat [10] respect* (0.53), got (0.47) respirátor* [12] respirator* (1.00) restaurace* [51] restaurant* (1.00) rezervace* [16] park* (0.67), elephants* (0.33) režisér* [10] spielberg (0.59), director* (0.41)

Conclusion The reported experiments are, to our knowledge, the first demonstration of the methods mentioned above for Czech and English parallel corpora. The results of automatic paragraph and sentence alignment on the computer-oriented corpora reach a similar quality (96%) as those achieved on Canadian Hansards. Results on fiction corpora are worse (85%) because of the lower quality and non-literality of translations. The results of the dictionary extraction for the computer-oriented corpora are of unexpectedly high-share (weighted precision) rates about 85%, and for the terminology dictionary (containing only noun phrases), they are even better, 87%–91%. Soon, the results of this work will be used in practice for translation purposes.



JB[v.20020404] Prn:15/02/2007; 12:54



F: BCT801.tex / p.10 (855-882)

ˇ Martin Cmejrek and Jan Cuˇrín

References Brown Peter F., S. A. Della Pietra, V. J. Della Pietra, and Robert L. Mercer. 1993. “The Mathematics of Statistical Machine Translation: Parameter Estimation.” In Computational Linguistics 19(2): 263–331. ˇ Cmejrek Martin. 1998. “Automatická extrakce dvojjazyˇcného pravdˇepodobnostního slovníku ˚ MSc. Thesis, Institute of Formal and Applied Linguistics. Charles z paralelních textu.” University, Prague. 82 pp. (in Czech). Cuˇrín Jan. 1998. “Automatická extrakce pˇrekladu odborné terminologie.” MSc. Thesis, Institute of Formal and Applied Linguistics. Charles University, Prague. 89 pp. (in Czech). Gale William A. and Kenneth W. Church. 1993. “A Program for Aligning Sentences in Bilingual Corpora.” In Computational Linguistics 19(1): 75–102. Hajiˇc Jan and Barbora Hladká. 1998. “Tagging Inflective Languages: Prediction of Morphological Categories for Rich, Structured Tagset.” In Proceedings of Coling/ACL’98. Montreal, Canada. Wu Dekai and Xia Xuanyin. 1994. “Learning an English-Chinese Lexicon from a Parallel Corpus.” In Association for Machine Translation in the Americas, Oct. 94: 206–213. Columbia, USA.

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.1 (46-111)

Words from Bononia Legal Corpus R. Rossini Favretti, F. Tamburini, and E. Martelli1

The analysis of special multilingual corpora is still in its infancy, but it may serve a particularly important role for the directions it offers both in cross-linguistic investigation and in the selection of the most typical features of text types and genres. To exemplify the information which can be obtained from corpus evidence, the paper reports on an on-going corpus-driven research project, named Bononia Legal Corpus (BOLC). The main aim of BOLC is to build multilingual machine readable law corpora. Data are at present limited to English and Italian, but an extension is envisaged to include other languages. Before the first sample, a preliminary pilot corpus was constructed to consider European legislation and create a conceptual framework to be used as a first-level experience. In the paper, Sections 2 and 3 describe the corpus design and formatting as well as the corpus access tools. Sections 4 and 5 discuss two case studies and analyse two semantic areas which can be seen as two ends of the same variational continuum. At one end, we consider the words contratto and contract, which through the extension of international transactions and circulation may be supposed to have acquired transnational traits. At the other, we focus on a semantic area which may be expected to present translation problems for the differences existing in the two socio-institutional systems. Reference is made to the English words tax and duty and to the Italian words tassa and imposta.

.

Introduction

The use of computer-based text corpora can be considered one of the most significant developments in linguistic research in the last decade. Text processing has opened wide perspectives in the investigation of data for scientific purposes. It has become a major concern to approach linguistic data through large corpora of naturally occurring language, attaining insights into different levels of language . R. Rossini Favretti took charge of Sections 1, 4, 5, and 6; F. Tamburini and E. Martelli took charge of Sections 2 and 3.

JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.2 (111-160)

R. Rossini Favretti, F. Tamburini, and E. Martelli

description. On the one hand, the approach has been facilitated by the developments in hardware technology and by on-line access to textual resources. On the other, it has taken advantage of computational techniques for the retrieval and statistical processing of the data. Corpus linguistics has had an important impact on different aspects of linguistic research, and statistical tabulation has proved to be a basic starting point not only for quantitative but also for qualitative analysis of different types of language. A great number of general corpora were constructed, and relevant results have been obtained. In our opinion, corpus evidence may serve a particularly important role in the analysis of special corpora for the directions it offers in the investigation of large samples of texts and in the selection of the most typical features of text types and genres. The paper reports on an ongoing corpus-driven research project carried out at the University of Bologna. The main aim of the project – named Bononia Legal Corpus, or BOLC – is to build multilingual comparable machine-readable law corpora. It is an interdisciplinary project, and John Sinclair has played a crucial role as consultant. Work was begun in 1997, and, if everything goes according to plan, carrying out the project will take five years, 1997–2001. Data are at present limited to English and Italian, but an extension is envisaged to include other languages. As to the size of the corpus, we set 10 million words as the smallest target for each component. English and Italian legal texts were chosen as representative of two different legal systems and of differences existing between the common law system developed in England and the civil law system, based on the Roman law, developed in Italy. Before the first sample, a preliminary pilot corpus was constructed to consider European legislation for the transnational dimension which is implied in the coexistence and cooperation of different nationalities. It was directed at creating a conceptual framework to be used as a first-level reference. We chose to refer to secondary Community legislation and, in particular, to “Directives” and “Judgments” as they may be implemented by domestic legislation and may produce direct legal effects in member states. They are seen as text types on either side of the border between parallel1 and comparable2 corpora. As the texts are to be representative . A “parallel corpus” has been described as “a bilingual or multilingual corpus that contains one set of texts in two or more languages” (Teubert 1996: 245). According to Teubert, it may contain 1) only texts originally written in language A and their translations into languages B (and C . . . ); 2) an equal amount of texts originally written in languages A and B and their respective translations; or 3) only translations of texts into the languages A, B, and C, whereas the texts were originally written in language Z. . The term “comparable” is used to describe corpora in two or more languages that have a similar composition and can be compared because of their common features.

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.3 (160-212)

Words from Bononia Legal Corpus

of contemporary legal language, the documents chosen were issued in the period 1968–1995. Reviewing briefly, the research is aimed at providing contrastive information on meaning and usage to guide lexicon builders and at indicating the standards of accuracy and detail required of future lexicons to be effective tools for translation and other applications. In this paper, Sections 2 and 3 describe the corpus design and formatting as well as the tools used to access corpus data. Sections 4 and 5 discuss two case studies on the basis of the analysis carried out in the pilot corpus now available – about 18 m.w. We consider two semantic areas which can be seen as two ends of the same variational continuum. At one end, we will consider the English word “contract” and the Italian word “contratto” which, through the extension of international transactions and circulation, may be supposed to have acquired transnational traits. On the other, we will focus on a semantic area which may be supposed to present translation problems for the differences existing between the two socio-institutional systems. Reference will be made to the English words “tax” and “duty” and to the Italian words “tassa” and “imposta”.

. Corpus design and formatting The BOLC pilot corpus consists entirely of European Community documents, mainly directives and judgments. The documents exist in English and Italian and cover the production from the founding of the European Community to March 1995 for the Italian documents and to July 1996 for the English documents. It is important to underline that the Italian documents are a translation of the English ones, because the European Community draws up its original documentation only in English and French. We collected approximately one hundred and ten megabytes of electronic text for each language, divided as shown below: 2232 Directives: 6 500 000 words, 1798 Direttive: 5 800 000 words, 4472 Judgments: 13 700 000 words, 4471 Sentenze: 12 300 000 words. The retrieved documentation was not directly usable because there was much additional information mixed with the essential text and numerous orthographic errors. So a great deal of work was required to eliminate all that was unnecessary and inessential and to correct the mistakes. Many reference tags, multiple blanks between words, blanks between words, and punctuation marks were removed to standardise the document formatting and to save space. The documents were



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.4 (212-269)

R. Rossini Favretti, F. Tamburini, and E. Martelli

coded in SGML ISO-Latin-1 to make the corpus platform independent. The problem was that the original documents contained a lot of characters, especially accents in Italian, which are correctly displayed in a DOS computer but not on different ones. The SGML coding is an international standard for multilingual documents, correctly handled by different computers. In the earlier Italian documents, there were incorrectly written words, some others without accents, and so on. We solved this problem by comparing each word with an electronic dictionary, augmented with all the Italian verb conjugations, inserting all the requested accents, and fixing most of the remaining errors. Finally the single documents were joined together in four subcorpora and then indexed to be correctly handled by the corpus access tools. . Corpus access tools . Corpus data retrieval Nowadays there is an increasing need for large corpora, both to investigate changes in everyday language – such as “monitor corpora” – , that foresee no finite size but a flow of information and linguistic evidence filtered through devices, to create an exact picture of the real up-to-date language (Sinclair 1991) – and to analyse extremely specialised linguistic features. In order to manage this amount of data, we need adequate computational procedures that have to be general – they have to accept different approaches to mark-up, tokenisation, languages, etc. – flexible – they must allow corpus maintenance and adaptation – user friendly, and, last but not least, they have to be extremely fast. In response to these needs, O. Mason (1996) has devised CUE (Corpus Universal Examiner), a set of computer programs able to address all the requirements of a modern corpus retrieval application. The first version of CUE was written in C++ for UNIX systems, using the publicly available library Xforms (Zhao and Overmars 1995; Reichard and Johnson 1996) for the interface design. It involves complex indexing schemes (inverted index), fast procedures for the retrieval and access of data, and compression methods (Huffman coding) to reduce the amount of space needed to store the corpora. The main problem with this application was that it followed the standalone application paradigm. This meant that only the workstation that stored the corpora would have immediate access to them. Even if a complete Networked File System were provided, the application would run only on UNIX machines. When we started the BOLC project, it was immediately clear that having only one station with corpus access did not meet our needs and that we had to provide a different access method for users. The decision was to transform the standalone version of CUE into a client-server application in such a way that the server machine can provide corpus access across our Local Area Network. Moreover,

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.5 (269-310)

Words from Bononia Legal Corpus

Figure 1. Client/Server structure of JCUE, developed at CILTA

we had to address a different problem, the multi-standard nature of our client workstation. At CILTA, we currently have Windows-based PCs, Macintoshes, and UNIX workstations. It was not conceivable to develop and maintain a different client application for each kind of operating system/hardware platform pair. The natural, and unique, solution to such a problem was to develop the CUE client side in Java, obtaining, in theory, complete portability among different systems without any further effort. Figure 1 shows the scheme of the new version of CUE (called JCUE), developed at CILTA. The server side was derived from the original CUE release. It is written in C++ and runs on a Sun Ultra Enterprise 450 with 512 MB of memory and 20 GB of disk space supporting the Solaris 2.7 operating system. It was implemented following the concurrent server model so that it can accept multiple queries from different client machines at the same time. Once a new client makes a request to activate the service, a new copy of the server program is created; it remains active after the client closes the connection. It is important to note that, for security reasons, the client has to provide authentication – as a legal JCUE client program – and the user, who is trying to access this service, has to provide passwords. In this way, we can restrict the use of some corpora to particular users or research teams. The most complex work was to divide the standalone application into a server side and a client side, providing a complete set of operations needed to retrieve data from the network. We developed a scheme similar to Remote Procedure Call technique, building a client-and-server-module interface to the network communication protocol. Figure 2 outlines the methods. These modules transform the request and the data from the client side in string codes that are sent across the network using the standard BSD socket support. Using a similar scheme, they transform the data retrieved by the server in a similar way and send it back to the client.



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.6 (310-360)

R. Rossini Favretti, F. Tamburini, and E. Martelli

Figure 2. Communication structure for JCUE package

The client side was completely redesigned using Java (version 1.2) and is currently working on Windows 95/NT PCs, Macintoshes, and Sun-Solaris UNIX workstations. We faced a number of problems using Java, mainly due to the differences among the implementation of the Java runtime machine on different architectures. This is why we decided to develop the client in the first, widely implemented, version of Java. We also developed an X-Window version of the client for UNIX machines, directly derived from the original CUE package. . Source document extraction For an in-depth analysis of parallel corpora it is often not sufficient to examine only the concordances produced using a retrieval procedure. Sometimes, in order to clarify the relationship among words from different languages, it is necessary to examine the entire document that contains a determinate concordance, even if features that furnish the extended concordance context are available. Moreover, this kind of analysis is often carried out using separate programs that align parallel document texts. In order to satisfy these needs, we developed a system for document identification and a separate client-server application for the document retrieval. This application, that we called Corpus Document Extractor (JCDE), behaves in a similar way as JCUE package. A server, written in C++, runs on the station that contains the corpus data while a Java client, that communicates with the server across the network, interfaces the document retrieval procedure from every remote station (Windows 95/NT PCs, Macintoshes, UNIX workstations). Using this client/server application the user can retrieve the documents contained in the corpora, specifying only the document identification string.

. The terms “contratto” and “contract”: Translation equivalences To illustrate the information which can be obtained about the syntactic and semantic structures of the terms under investigation, an example, the term “contratto” was selected from the Italian subcorpus and used as the search node.

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.7 (360-528)

Words from Bononia Legal Corpus

The selection of the term was determined by the relevance of the contract as a legal device. The contract, it has been argued, may be considered as the legal cornerstone of all transactions in business and consumer life. The law of contract is deeply embedded in the business practices of different countries. Different legal systems may vary substantially on a number of matters owing to historical, institutional, or commercial reasons, but in recent times with the rapid expansion of trade and business, attempts have been made to limit the effect of dissimilarities in the contract law of different legal systems. A process of “internationalition” may be assumed, in spite of the deep-rooted divergencies still existing between the systems of common law and civil law. To identify the collocates of the term “contratto” the concordances were automatically selected from 4642 citations: n anticipo sull ’ aiuto relativo al ti o non siano comunque conformi al ne finale sull ’ aggiudicazione del r ) , relativa alla risoluzione del loro perdere l ’ aggiudicazione del nio successivo alla conclusione del l danno o chieda la risoluzione del ne a garanzia dell ’ esecuzione del tta tabella . la caratteristica del esa a seguito della risoluzione del di due anni dopo l ’ estinzione del sostanza , che la comunicazione del ommissione nell ’ inadempimento del mento di diverse somme in forza del tto giorni dopo la stipulazione del e constatate nell ’ adempimento del i applichino fino alla scadenza del to soltanto dopo la conclusione del e gli obblighi che ha in virtu’ del te o esecutore contemplato da detto e l ’ onere pecuniario ( diritto di ere in considerazione in materia di nimento dei diritti connessi con il o e relative all ’ esecuzione di un utiva di competenza contenuta in un to secondo cui la conclusione di un romesse dall ’ aggiudicazione di un , per il 30 settembre 1978 , di un e concernente l ’ esecuzione d ’ un 1 : se la clausola contenuta in un

contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto contratto

, anticipo che le veniva versato di fornitura . 2 . quando : a ) , sono prese da detto stato . le ed alla condanna al risarcimento d ’ appalto per la costruzione d d ’ appalto iniziale ; h ) quand per inadempimento della contropa garantito ) condizioni particola di agente ausiliario e la precar di locazione - vendita mediante . 4 . il presente articolo lasci Statoil non e’ " necessaria " , per una colpa commessa all ’ a di lavoro o a causa della sua di in questione " . 20 gli artt . 1 non siano imputabili ne’ a colpa . Se necessario e’ possibile as di ammasso . 2 ) l ’ operazione d ’ agenzia . articolo 19 le par abbia trasferito il suo diritto ) applicato sul risone prodotto di lavoro e’ quella che caratter di lavoro , compreso il mantenim di lavoro , le disposizioni dell scritto di concessione esclusiva di ammasso di formaggi e discipl di appalto di lavori pubblici fi di compravendita di latte intero di fornitura di mangimi stipulat di concessione di licenza , seco

As a following step, the term “contract” was selected from the English subcorpus, and these concordances were automatically selected from 5449 citations: . An important characteristic of es with which they have concluded d to state first of all whether " f that training takes place under . An important characteristic of

a a a a a

contract contract contract contract contract

for the employment of auxiliary s for the supply of animals or seme for the supply of beer concluded of apprenticeship concluded under for the employment of auxiliary s



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.8 (528-565)

R. Rossini Favretti, F. Tamburini, and E. Martelli

t of obligations which arose from a h the flexon - italia undertaking a usion and termination of the agency sferor resulting from an employment wing entry into force of the export ontract : 4 . Criteria for award of erning indemnity for termination of cluded on the grounds of freedom of public works at issue by a private thorities who have awarded a public if necessary , adjust the research nce by the other party to the sales counterclaim arising from the same . 7 . Criteria for the award of the the agency or branch concluding the ch a list in the state awarding the s by expressly stipulating that the d ) the date of commencement of the e the date of the conclusion of the nsidered suitable to tender for the or admittance to participate in the is rights and obligations under the elgian law , the dissolution of the uired to do so if it is awarded the roof of Fiat ’ s strong position in

contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract contract

of employment or an employment re for the cleaning of the establish Article 13 1 . Each party shall b or employment relationship and ar , shall be the condition preceden : 5 . Number of tenders received between the principal and the com of the parties to the Collective and had failed to publish a notic or have held a design contest sha to the new situation with the app under which the goods were to be or facts on which the original cl . 8 . Other information . 9 . Dat is situated ( a ) 3 . The address may be required of contractors es should be governed exclusively by or employment relationship ; ( e . 3 for the 1971 / 72 wine - grow in question . However , such a me that , during the three previous without the franchisor ’ s appr by the court , on the ground of t , to the extent that this change negotiations . ( 721 et seq . ) .

If we begin by examining the environment of the term “contract”, we notice that “contract” appears 1) as a headword, 2) as a modifier of a noun group, or 3) as a single-word term, often preceded by a determiner. Let us consider the first position to the left of the node (designated N–1). We find two kinds of collocates: grammar words and full lexical words. Both in Italian and in English concordances, we notice a high occurrence of the article – both definite and indefinite – often preceded by a preposition, in N–2 position. “Of ” and “di” dominate the pattern. In each of the tables, if we look at N–3 position, we notice the occurrence of a noun. A regular pattern can be identified in the following noun groups where processes inherent in the commencement, performance, and conclusion of the contract are expressed: award of (the) contract breach conclusion commencement dissolution execution performance publication rescission signature stipulation

aggiudicazione del contratto inadempimento conclusione inizio scioglimento esecuzione adempimento pubblicazione estinzione firma stipula, stipulazione

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.9 (565-673)

Words from Bononia Legal Corpus

suspension termination

sospensione risoluzione

A noun group emerges as particularly relevant: noun + di [+ determiner] + contratto noun + of [+ determiner] + contract where the noun is a derived nominal. The subjective value of terms denoting the contract is constant: 1. (a) la conclusione del contratto 1. (b) il contratto è concluso 2. (a) the conclusion of the contract 2. (b) the contract is concluded In the collocations provided in the tables, a number of equivalences may be identified in the lexicalization of the contract procedures, but a difference emerges, even from a superficial glance, in the conceptual extension of the terms “contratto” and “contract”. In a number of concordances, corpus evidence suggests two different senses for “contract” which have their translation equivalents in Italian, in 1) “contratto” and 2) “contratto d’appalto”. A striking feature in the tables is that various kinds of lexically specific information is associated with “contract” in: 2. (a) the conclusion of the contract and in: 3. the award of the contract The nature of the contract, in its most salient and typical components, is strictly tied to the collocate, particularly, in 3, to the word “award”. “Award” is a far more important collocate (610) in English than “aggiudicare” (55) and “aggiudicazione” (7) are in Italian. To illustrate this point let us consider the following citations selected automatically from our corpus: onclusion of a contract following its he grounds on which it decided not to 2 . Where the contracting authorities he grounds on which it decided not to ating to the contract provide for its either require the concessionnaire to umber of contracts awarded ( where an as part of a procedure leading to the to participate in procedures for the

award award award award award award award award award

, the powers of the body responsi a contract in respect of which a a contract by restricted procedur a contract in respect of which a at the lowest price tendered , th contracts representing a minimum has been split between more than of a service contract the estimat of contracts may be made by lette

. CPC reference number . 4 . Date of rticle 16 m ) : 13 . Criteria for the 1976 coordinating procedures for the e scope of the law procedures for the icle 40 , information relating to the ng coordination of procedures for the

award award award award award award

of of of of of of

the contract . 5 . Criteria fo the contract . Criteria other public supply contracts ( 6 ) public works contracts other t contracts . 3 . As regards ind public works contracts ( 89 /



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.10 (673-787)

R. Rossini Favretti, F. Tamburini, and E. Martelli

25 And 26 ( d ) the criteria for the nation of national procedures for the on of suppliers or contractors and of NS Article 28 For the purposes of the have a fair opportunity to secure the n . Article 7 For the purposes of the commencement of the procedures of the has been committed during a contract contracting authority : 2 . ( a ) The articipating in the relevant contract contracting authority : 2 . ( a ) The the contracting authority . 2 . ( a ) that law as regards : ( a ) contract nders before deciding to whom it will a contracting authority , who wish to

award award award award award award award award award award award Award award award award

of the contract if these are not of public supply contracts ; Wher of contracts , contracting entiti of public contracts by the contra of contracts , but does not conta of public contracts by the contra of the contract ( s ) ( if known procedure falling within the scop procedure chosen : ( b ) Form of procedure the opportunity to make procedure chosen : ( b ) Where ap procedure chosen . ( b ) Where ap procedures falling within the sco the contract . For this purpose i works contracts to a third party

“Contract” may occupy different positions in the verbal co-text of “award”, but it is always present in its role structure. At this point, it is worthwhile considering the patterns in both languages. Let us examine the concordance of the limited examples of “aggiudicazione” in Italian: delle nuove forme contrattuali di aggiudicazione degli appalti e introdurre c opo di coordinare le procedure di aggiudicazione dei contratti di appalto di lavori da dare in appalto e l ’ aggiudicazione del contratto sono due opera 1 . Laddove il criterio per l ’ aggiudicazione del contratto sia quello del di un contratto in seguito all ’ aggiudicazione dell ’ appalto , i poteri de di un contratto in seguito all ’ aggiudicazione dell ’ appalto , i poteri de di appalti ; considerando che l ’ aggiudicazione di contratti relativi a dete

In Italian “aggiudicazione” and “appalto” are important collocates of the term “contratto”, but in a number of examples, they occur without “contratto” as a collocate. As far as we can ascertain in our corpus, “contratto” and “appalto” are not necessarily “mutually expectant words”. The following concordance of “appalto”, automatically selected from 728 citations, may illustrate this point: er le forniture cui si riferisce l successivo alla conclusione dell UARE TALE TRASFORMAZIONE QUALORA L calcolo del valore di stima dell alitativa e di aggiudicazione dell alcolo dell ’ importo stimato dell al quale sarà stato aggiudicato l he cos tituiranno l ’ oggetto dell seguito all ’ aggiudicazione dell per partecipare ad una procedura d VERSIA SORTA DA UN BANDO DI GARA D purché le condizioni iniziali dell separabili dall ’ esecuzione dell

’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’

appalto appalto APPALTO appalto appalto appalto appalto appalto appalto appalto APPALTO appalto appalto

, relativo agli ultimi tre eserc iniziale . 4 . In tutti gli altr GLI VENGA AGGIUDICATO . ARTICOLO : - nell ’ ipotesi di appalti un e che esse non prevedono la poss è : - se trattasi di appalto di : 6 . a ) Data limite di ricezio ; b ) l ’ avviso deve indicare c , i poteri dell ’ organo respons o ad un concorso di progettazion DELL ’ ADMINISTRATION DES PONTS non siano sostanzialmente modifi iniziale , siano strettamente ne

’ AGGIUDICAZIONE DEL CONTRATTO D ’ APPALTO PER LA COSTRUZIONE DELL ’ ISTITU . c ) Eventualmente , forma dell ’ appalto che è oggetto della gara . 3 . a NECESSARIE NEL CORSO DELLA GARA D fferenti e l ’ aggiudicazione dell MPRESE CHE PARTECIPANO ALLE GARE D lo di gara relativo al contratto d

’ ’ ’ ’

APPALTO appalto APPALTO appalto

, COMPRESA LA DECISIONE FINALE S possano aver luogo simultaneamen O ALLE QUALI SONO AGGIUDICATI AP n . 4 del progetto relativo all

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.11 (787-885)

Words from Bononia Legal Corpus

IONE , A TRATTATIVA PRIVATA , DELL ’ ATA IN GRADO DI AGGIUDICARE UN NUOVO CCIANO O MENO PARTE INTEGRANTE DI UN usole contrattuali di un determinato ori all ’ impresa titolare del primo ditore che desideri partecipare a un ente - Riserva di una frazione di un catrici e che intendono stipulare un le amministrazioni aggiudichino un onsiderare un accordo quadro come un di automazione del gioco del lotto ◦

APPALTO APPALTO APPALTO appalto appalto appalto appalto appalto appalto appalto Appalto

PER LA REALIZZAZIONE DELL ’ IMPI . PER I MOTIVI GIÀ ESPOSTI IN PR DI LAVORI PUBBLICI . 3 . L ’ ART , di prescrizioni tecniche che m , a condizione che i nuovi lavor pubblico di lavori può essere in pubblico alle imprese situate in di lavori con un terzo , ai sens mediante procedura negoziata sec ai sensi dell ’ articolo 1 , par non riguardante attività che imp

All these patterns: 4. l’aggiudicazione del contratto d’appalto 5. l’aggiudicazione degli appalti / dell’appalto 6. l’aggiudicazione del contratto find their translation equivalence in: 3. the award of the contract In English, it is the process expressed by the verb “award” which is associated with the peculiar typology of contract 2. What can be argued, in the present connection, is the fact that in all the English examples of the corpus it is in the collocates such as “award” and “tender” that we find the lexical information which is associated, in Italian, with “contratto d’appalto” or “appalto”. A second notable feature which emerges in the comparative analysis of the tables of “contratto” and “contract” is the way in which the contract type is specified through pre-modification (N–1) in English and post-modification (N+1 and N+2) in Italian: 7. agency contract 8. contratto d’agenzia Examples of post-modification may be found also in the English subcorpus, but pre-nominal modification prevails in English whereas post-nominal modification prevails in Italian. If we look at the syntactic environments of the words “contratto” and “contract”, a further difference between the syntactic structures of the two languages is illustrated by the class shift taking place when “contract” occurs as modifier: 9. contract negotiations 10. negoziazioni contrattuali The word “contrattuale” has a high occurrence (490) in the Italian examples, and “contract” is its translation equivalent in English: del dipendente di ruolo e quella , contrattuale , dell ’ agente temporaneo , ia cambiamento , dovuto a cessione contrattuale o a fusione , della persona f a quelli operati mediante cessione contrattuale oppure mediante fusione , que



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.12 (885-927)

R. Rossini Favretti, F. Tamburini, and E. Martelli

carico della commissione una colpa contrattuale di cui essa deve rispondere . ri o carenze nel suo comportamento contrattuale , come un ritardo nell ’ appr in fatto di responsabilita’ extra contrattuale . quanto al problema della pr ziarsi sulla responsabilita’ extra contrattuale della comunita . 4 . la const n materia di responsabilita’ extra contrattuale , il trattato assoggetta la c attribuisca importanza alla forma contrattuale - acquisto o leasing - nemmen 68 - competenze speciali - materia contrattuale - concessione esclusiva - lit nto dell ’ obbligazione in materia contrattuale . 19 e ’ vero che questa norm dots ’ . 9 la nozione di materia contrattuale serve quindi di criterio per te della prima dell ’ obbligazione contrattuale di consegnare alla Rewe - zen , al di fuori di qualsiasi obbligo contrattuale , conceda speciali agevolazio ssicurazione avente base puramente contrattuale non rientra quindi , ratione ri , nonche’ in materia di diritto contrattuale. qualsiasi disposizione cont che non si ricollega alla materia contrattuale di cui all ’ art . 5 , punto ai assunto alcun obbligo di natura contrattuale nei confronti del subacquiren i ) , che lo statuto ha una natura contrattuale e che , percio’ , una clausol ata la liberta’ della negoziazione contrattuale dei diritti sancita dalla pre stione sub 1 : se l ’ obbligazione contrattuale , secondo la quale il concess ) , da un lato , nella sua prassi contrattuale , imposto alle sue contropart nave * in tonnellate * del prezzo contrattuale ( 1 ) ( 1 ) l ’ equivalente peso pari al 90 % del quantitativo contrattuale , a prescindere dal fatto che senza raggiungere il quantitativo contrattuale , l ’ importo dell ’ aiuto vi

This may be traced back to the different formation of noun groups in the two languages. In English, most noun groups consist of two or more nouns. In Italian, they predominantly consist of a noun either preceded or followed by one or more adjectives. This can have an important bearing on our analysis of right and left collocates. If we go on in our analysis and consider the first position to the right of the node (N+1), we find prepositions as predominant collocates. The preposition of (821) and the preposition di (1386) prevail, followed by a noun in N+2 position: contratto + di + noun contract + of + noun Another notable feature, in English, is the occurrence of the preposition for (217) when the noun is preceded by the definite article. When for is associated with a determiner and a noun, the noun is usually qualified by a prepositional phrase: contract + for + determiner + noun + of + noun A constant distinction is drawn between phrases like: 11. a contract of employment and phrases like: 12. a contract for the employment of auxiliary staff Such distinction has no equivalent in Italian: un contratto + di + noun [+ di + noun]

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.13 (927-1021)

Words from Bononia Legal Corpus

In the cross-language analysis, we can say that syntactic differences play a more important role than lexico-semantic ones. It remains to be seen whether these results have a general value or are limited to the terms under scrutiny.

. Translation equivalents of the terms “tax” and “duty” . The term “tax”: What the English subcorpus shows To exemplify a situation where cross-language equivalence cannot be assumed, we will refer to the tax law and analyse, as a second case study, the word “tax”. Through the word “tax”, a situation is referred to which can be considered common both to England and to Italy and can be assumed to apply, with the extension of our corpus, to other European countries as well. In all countries, taxes are levied on income and expenditure by central and local governments, but different categories are employed in their definitions. It is our hypothesis that some of the main categories may emerge from interlinguistic comparison. As a first step, we will consider the following concordance of the word “tax”, selected automatically from our corpus where there are 7722 citations altogether: ve rise , for the purposes of turnover purposes of the rules on value - added ut as long ago as 1967 , only turnover ssary steps to permit the remission of ates of the Portuguese motor - vehicle taxes - common system of value - added ly Section 10 ( 2 ) of the value added for the special purposes of the income led a Member State may not refuse that nal legislation for qualifying for the , by granting exemptions from turnover nting the common system of value added iple , goods acquired free of turnover Article 1 1 . Exemption from turnover eria laid down by law , which give the that a system of road tax in which one eedings instituted by H . Lennartz , a the Commission has not challenged the

tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax

, to a new immovable property comp , by agreement with one of his emp , which at that time was applicabl , in accordance with the procedure , which increases sharply as from - duties or charges which cannot b act 1972 , which reduces the taxab acts , hereby rules : Community la advantage on the basis of suppleme advantage in question . ISSUE 1 in and excise duties in respect of th and amending Directive 77 / 388 / and excise duties in the course of and excise duty on imports shall a authorities no discretion and make band comprises more power - rating consultant in Munich , concerning differential between sparkling win

with the rule that there should be no ions , be justified in an area such as inding - up entails in company law and n of the programme of harmonization of e shall be entitled to deduct from his f his business , where the value added taxation Whereas a Community system of down by Member States until Community e chargeable event shall occur and the ance with the cumulative multi - stage orities of the Member States where the ates of a common system of value added

tax tax tax tax tax tax tax tax tax tax tax tax

discrimination - ( EEC Treaty , ar law , it must be observed in this law . The legislation of other Sta legislation pursuant to Article 99 liability the value added tax due on the goods in question or the co reductions on imports has proved n rules are adopted . The exemption shall become chargeable at the tim system has constantly given rise t warehouse is authorized ; ( b ) co Whereas a system of value added ta



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.14 (1021-1074)

R. Rossini Favretti, F. Tamburini, and E. Martelli

On inspecting the concordances we observe that “tax” tends to occur either followed or preceded by a noun or a noun group. Like “contract”, it occurs 1) as a modifier, 2) as a headword, and 3) as a single word term. In a particularly high number of examples, it occurs as a modifier in a noun group. As its top ten collocates in N+1 position, we find: provisions (337) system (196) purposes (165) authorities (132) burden (101) legislation (97) advantages (93) arrangements (81) exemptions (65) exemption (58)

In the examples where the term “tax” occurs as a headword, it is associated with pre-nominal (N–1) or post-nominal (N+1) modification. N–1 position may be occupied: – by a noun turnover (605) income (102)

– by an -ed modifier value-added (664)

– by an -ing form withholding (12)

In the examples where the word “tax” is not associated with pre-modification, N–1 position is occupied: – by a preposition of (588) for (165) to (157) from (71)

– by an article the (1294) a (324)

On the right, where a noun does not occur in N+1 position, the position is often occupied by a preposition, and “tax” is qualified by a prepositional phrase: on consumption (49)

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.15 (1074-1174)

Words from Bononia Legal Corpus

The occurrence of “tax” without modification tends to concentrate in instances where the term is either preceded or followed by a comma or by connectives: duty and tax turnover tax and excise duty The examples suggest that “tax”, in its singular form, presents three different senses: 1) a general, indefinite one, in the first instances, when followed by a noun and used as modifier; 2) a general collective one, in the second group of instances, when it is not associated either with post-modification or pre-modification; 3) a specific one, when it is preceded by a modifier in N–1 position. There is a hyponymic relation between 3 and 2, which may be exemplified by such pairs as “turnover tax” and “tax”. . The term “duty” In the concordance of “tax”, “duty” appears as a significant collocate. “Duty” collocates with “tax”, but the lexical environments of the two words are different. Their most prominent collocates do not overlap, as the concordance below automatically selected from 5705 citations, illustrates: adopted for the imposition of excise having equivalent effect to a customs other than products subject to excise having equivalent effect to a customs e Commission objections regarding the inks , the real value of the rates of products , both net and inclusive of ng prices , both net and inclusive of nal excise duty , the specific excise nd the sum of the proportional excise ing an effect equivalent to a customs which may be : - either an ad valorem hich has the characteristics of stamp fix the amount of the specific excise ffect of the increase in the rates of beer - export refund - countervailing the application of the anti - dumping ove that the adjustment of the excise with the imposition of anti - dumping - Luxembourg Economic Union , excise l measures introducing a differential t the Member States to impose capital ing an effect equivalent to a customs xemption from turnover tax and excise ion to the bound duty , an additional otnote ( a ) concerning an additional es . 4 . Where necessary , the excise

duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty duty

’ . 4 the appeal lodged by gb , and The application of any quant , paragraph 1 shall not apply to s , contrary to Articles 12 et seq . - free importation of the instrume and the wider objectives of the Tr and tax _ the estimated average gr and tax , whether published or not and the turnover tax levied on the and the turnover tax , in such a w but is in reality intended to offs calculated on the basis of the max charged on the acquisition of buil levied on the cigarettes under com on spirits on 7 September 1977 by on imports . Case c - 152 / 89 . I on ball - bearings and tapered rol on beer leads to over - taxation o on products assembled or produced on beer is levied in Belgium and L on coal imported from the open mar on an interest - free loan granted on exports , as prohibited in trad on imports in international travel on sugar , corresponding to the ch on sugar . This footnote provides on cigarettes may include a minimu



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.16 (1174-1290)

R. Rossini Favretti, F. Tamburini, and E. Martelli

hether the charge is in the form of a duty or tax or in the form of an equali ctual increase of the rate of customs duty or from a rearrangement of the tar

Pre-nominal and post-nominal modifications prevail in N–1 and N+1 positions, but its collocates are different if compared to “tax”: dumping (716) customs (617) excise (598) definitive (308) free (296) imports (285) rate (259) provisional (257) subject (160) products (141) Terms like “dumping” or “customs” do not collocate with “tax”, nor does “turnover” collocate with “duty”. Through the term “income tax”, direct taxes are exemplified whereas through “excise duties” indirect taxes are exemplified. Duty is a tax levied on commodities, transactions, or estates rather than on persons. It is an indirect tax. On closer inspection of the collocates of “tax” and “duty”, we see that in the first group of examples where “tax” occurs, reference is primarily made to direct taxation, while in the second group of examples where “duty” occurs, reference is primarily made to indirect taxation. In English, a primary distinction is drawn between direct and indirect taxation. In this distinction, a deviant example can be found in the occurrence of “VAT” and “value-added tax”, a tax paid on the supply of all goods and services in the U.K., introduced in 1973 to harmonize the British tax system with that of the other European Community countries. The occurrence may be explained by the general character acquired by the tax and by the superordinate value that the term “tax” holds. . A cross-linguistic comparison If we consider the data of the Italian subcorpus, we find significant similarities and differences in the translation equivalents. As to the first meaning of “tax”, for instance, it will be observed that a class shift is implied as the adjective “fiscale” (1696) appears to be its translation equivalent in Italian, collocating with such words as “sistema”, “carico”, “franchigia”, “deposito”, “esenzione”, “evasione”, etc.. As we have seen, this may be traced back to the different composition of noun groups in English and Italian: oppure il diritto a tale agevolazione fiscale spetti solo nel caso in cui l ’ reclami rivolti all ’ amministrazione fiscale , sia i ricorsi giurisdizionali

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.17 (1290-1374)

Words from Bononia Legal Corpus

e d ’ appello , l ’ amministrazione vacanze e sottraendone l ’ anticipo assimilati ad essa sotto l ’ aspetto nda dei casi , nella stessa categoria icolare all ’ efficacia del controllo ueva quindi interamente il suo debito conseguenza , al sorgere di un debito bro in cui e’ autorizzato il deposito diritto delle societa’ e del diritto he il cantisani , nella dichiarazione ato che il divieto di discriminazione e autoveicoli importati in franchigia are il rischio di evasione o di frode la " tax evasion " , cioe’ alla frode mente il principio dell ’ imposizione . secondo le disposizioni della legge ituire tributi che non abbiano natura contraria al principio di neutralita’ in modo apprezzabile sul futuro onere ev ’ essere raffrontato con l ’ onere n e sottoposto ad alcun provvedimento lett . b ) , del codice di procedura le era volta a disciplinare il regime questione relativ... al diverso regime ezionistico di un determinato sistema nati all ’ esportazione in un sistema e i vini importati ad un sovraccarico implicante un determinato trattamento

fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale fiscale

ha riconsiderato la sua posizio e gli oneri sociali a carico vo , e di respingere il ricorso pe , doganale o statistica . b ) i o , ai sensi dell ’ art . 36 de , presentando pero le sue rimos in fatto d ’ imposta sulla cifr ; » . 4 ) all ’ articolo 14 e’ . altre legislazioni riconoscon dei redditi per il 1977 , aveva di cui all ’ art . 95 del tratt sarebbe un mezzo necessario , i . in particolare , non e provat . 30 e opportuno osservare che nello stato membro destinatario , il mutuatario puo’ dedurre da , ma siano istituiti specificam inerente al sistema comune di i , devono essere fornite indicaz pu’ ridotto effettivamente sopp o di effetto equivalente che ne ( livre des procedures fiscales in modo tale da farlo rimanere per le autovetture usate import nazionale ; orbene , risulta ch volto a finanziare il controllo atto a proteggere la birra di p , l ’ analogo prodotto importat

As far as meanings 2 and 3 are concerned, a parallel can be drawn between the occurrences of “tax” in the English subcorpus and of “imposta” in the Italian one. In a high percentage of cases, “tax” finds its counterpart in “imposta”. “Imposta” like “tax” is used as a superordinate, but if we consider the collocates of “imposta”, we notice relevant differences in the collocations of the two terms. Let us have a quick scan through the concordance of “imposta” (4209): la legge olandese relativa all ’ na esauriente delle fanchigie dall ’ l bene e , di fatto , gravato dall ’ cliente e’ registrato ai fini dell ’ dei prelievi rispetto a crediti d ’ rescrive il metodo di calcolo dell ’ gricoli . la parte " mobile " dell ’ le egli e’ registrato ai fini dell ’ relativa alle imposte diverse dall ’ one per determinare l ’ aliquota d ’ ssoggettare detta retribuzione all ’ la natura protezionistica di quest ’ ocedimento c - 353 / 90 " 1 ) se l ’ , in via di principio , a debiti d ’ rci cedute da privati , qualora un ’ ettiva osti alla riscossione di un ’ la struttura che le aliquote dell ’ 2 1 . le operazioni sottoposte all ’ ta’ di capitali . articolo 5 1 . l ’ hanno la facolta’ di riscuotere l ’

imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta

sulla cifra d ’ affari ha previs sull ’ entrata e dai diritti d soltanto in base al valore aggiu sul valore aggiunto ; - l ’ oper analoghi ai quali gli stati memb di conguaglio da applicare nei l contemplata dall ’ articolo 10 s sul valore aggiunto e destinati sulla cifra d ’ affari che grava applicabile ai redditi della mog nazionale sul reddito . di conse e accentuata dal fatto che essa sul consumo delle banane fresche sulla cifra d ’ affari all ’ imp del genere non venga riscossa su speciale sugli spettacoli e sugl stessa ; considerando che il man sui conferimenti sono tassabili e’ liquidata : a ) nel caso dell soltanto man mano che i conferim



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.18 (1374-1463)

R. Rossini Favretti, F. Tamburini, and E. Martelli

, di conseguenza , gli sgravi dell ’ i e pronunciata per il rinvio dell ’ oganali dalla base di calcolo dell ’ a deduzione totale o parziale dell ’ ore dell ’ imposta si verifica e l ’ ge tributaria ; l ’ incidenza dell ’ e parte del sistema nazionale dell ’ ta sull ’ entrata col sistema dell ’ tuto , e calcolata in ragione dell ’ ne di determinare l ’ aliquota della

imposta imposta imposta imposta imposta imposta imposta imposta imposta imposta

sulla cifra d ’ affari e delle a sul valore aggiunto in italia al proporzionale riscossa sulle sig sul valore aggiunto . tuttavia , diventa esigibile all ’ atto del controversa sui redditi comuni e sull ’ entrata . * / 667 j0007 / cumulativa a cascata e , in seco sul reddito pagata dai genitori dovuta su altri redditi non esen

A further difference is to be pointed out. Position N–1 is generally occupied by a definite article, and “imposta” is generally modified on the right. N+1 and N+2 positions are generally occupied by post-modification. In English data we find: pre-modification + noun while in Italian data we have: [determiner] + noun + post-modification The different structure of the noun group plays a role which cannot be overlooked and which will be the object of further analysis. It is interesting at this point to compare “duty” with “tassa” as we might expect it to be its equivalent. But we see that the occurrences of “tassa” are definitely lower as the term occurs in 1398 citations. Some of them, selected automatically, are reproduced here: dazi doganali . per stabilire se una determinati casi , l ’ esonero dalla istica b ) addizionale del 5 % sulla le caratteristiche essenziali di una direttamente da paesi terzi , di una ssione , da parte della pbc , di una ’ italia , in merito alla stessa ’ ggio 1987 , dichiara : un sistema di propriamente detto , costituisce una a seconda questione nel senso che la stessi criteri , puo’ costituire una tta di un onere unico , denominato ’ versato e l ’ importo massimo della mposizione di un contributo , di una gge 16 gennaio 1985 , n . 13 , sulla ve modifiche di detta legge , di una etato nel senso che esso colpisce la al trattato cee , comprenda anche la ntroversia verte sul pagamento della pesanti e riduzione parallela di una

tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa tassa

abbia effetto equivalente a quello all ’ esportazione per le patate ( automobilistica - lussemburgo : ta del genere . a norma dell ’ artico destinata a scopi previdenziali . destinata a sovvenzionare lo smerc di sbarco ’ , un ricorso per ina di circolazione che , mediante l ’ di effetto equivalente ai sensi de di compensazione riscossa sui vini di effetto equivalente ad un dazio di presentazione in dogana ’ . l differenziale sulle autovetture di d ’ iscrizione o di un " minerval d ’ immatricolazione degli autovei d ’ immatricolazione sulle automob postale per la presentazione in do scolastica percepita in base alla scolastica richiesta ad un dipende sugli autoveicoli versata dai vett

If we consider the collocates, we find that the word “tassa” is modified by adjectives such as “automobilistica”, “postale”, “scolastica” and by noun groups such as “di circolazione”, “d’immatricolazione”, “d’iscrizione”. The reference to direct and indirect taxation is not made in the distinction drawn in Italian between “imposta”

JB[v.20020404] Prn:15/02/2007; 13:03

F: BCT802.tex / p.19 (1463-1528)

Words from Bononia Legal Corpus

and “tassa”. Different conceptual categories are applied in the two languages. “Tassa automobilistica”, which finds its equivalents in the corpus data both in “vehicle tax” and in “vehicle duty”, is something paid for a consideration of value. A payment is due in return for services. An outstanding feature of Italian tax law is the distinction made with regard to contributions levied on a person with or without regard to personal services or advantages conferred on that person by law. The word “tassa” occurs when the payment is meant as a counterpart of personal or general services.

. Conclusion The analysis should be extended to include other terms such as “charge”, “rate”, and “fee”. Work is in progress. Even limiting our consideration to the terms under scrutiny, we can say that through the analysis of the collocates, the legal framework of the tax law emerges in its main outlines showing, through the collocates, relevant differences between the systems of civil law and common law. On the one hand, corpus evidence suggests that collocation plays a fundamental role in the definition of words. On the other, this shows that, in a number of cases, the origins of linguistic differences are to be sought in institutional and historical traditions of different countries as extrinsic forces may play a part in the semantic determination of the words under scrutiny. This raises a number of questions, but as a partial conclusion of our study, we can say that by making such empirical information available corpus linguistics may provide the tools for semantic analysis. As the development of special corpora continues and provides a more adequate database upon which to address questions, they ought to play an increasingly important role in linguistic description. We think that more research should be conducted in this direction.

References Aijmer, K. and B. Altenberg (eds.). 1991. English Corpus Linguistics. London–New York, Longman. Baker, M., G. Francis, and E. Tognini-Bonelli (eds.). 1993. Text and Technology. In honour of John Sinclair. Amsterdam, Benjamins. Atkins, S., J. Clear, and N. Ostler. 1992. “Corpus design criteria.” In Literary and Linguistic Computing 7(1). Oxford, Oxford University Press: 1–16. Biber, D. 1983. “Representativeness in corpus design.” In Literary and Linguistic Computing 8(4). Oxford, Oxford University Press: 243–257. Hart, H. L. A. 1953. Definition and Theory in Jurisprudence. Oxford, Clarendon Press. Mason, O. 1996. Corpus access software: The CUE system, TEXT Technology 6(4): 257–266.



JB[v.20020404] Prn:15/02/2007; 13:03



F: BCT802.tex / p.20 (1528-1595)

R. Rossini Favretti, F. Tamburini, and E. Martelli

Reichard, K. and E. F. Johnson. 1996. Using XForms. Unix Review: 84. Rossini Favretti, R. 1993. “Estate e tenure come espressione del concetto di proprietà feudale.” In Hart, D. (ed.), Aspects of English and Italian Lexicology and Lexicography: 244–253. Roma, LIS. Rossini Favretti, R. 1998. “Using multilingual parallel corpora for the analysis of legal language: the Bononia Legal Corpus.” In Teubert, W., Tognini Bonelli, E. and Volz, N. (eds.), Proceedings of the Third European Seminar, Translation Equivalence, The TELRI Association e.V./Institut für deutsche Sprache, Mannheim; The Tuscan Word Centre: 57–68. Rossini Favretti, R. 1999. “Scientific discourse: intertextual and intercultural practices.” In Rossini Favretti, R., Sandri, G. and Scazzieri, R. (eds.), Incommensurability and Translation. Cheltenham, Edward Elgar. Sinclair, J. M. 1986. “First throw away your evidence.” In Leitner, G. (ed.), The English Reference Grammar: 56–65. Tübingen, Niemeyer. Sinclair, J. M. 1987. Looking up. London and Glasgow, Collins. Sinclair, J. M. 1991. Corpus, Concordance, Collocation. Oxford, Oxford University Press. Sinclair, J. M. 1995. “Corpus typology. A framework for classification.” In Melchers, G. and Warren, B. (eds.), Studies in Anglistics: 17–34. Stockholm, Almquist and Wiksell International. Sinclair, J. M. 1996. “Multilingual databases. An international project in multilingual lexicography.” In International Journal of Lexicography 9(3): 179–196. Stubbs, M. 1995. “Collocations and semantic profiles.” In Functions of Language 2(1): 23–55. Svartvik, J. (ed.). 1992. Directions in Corpus Linguistics. Berlin/New York, Mouton de Gruyter. Teubert, W. 1996. “Comparable or parallel corpora?” In International Journal of Lexicography 9(3): 238–264. Thomas, J. and M. Short. (eds.). 1996. Using Corpora for Language Research. London/New York, Longman. Zhao, T. C. and M. Overmars. 1995. Forms Library. A graphical user interface toolkit for X, http://bragg.phys.uwm.edu/xforms.

JB[v.20020404] Prn:31/05/2007; 9:27

F: BCT803.tex / p.1 (45-112)

Hybrid approaches for automatic segmentation and annotation of a Chinese text corpus Zhiwei Feng

This paper describes the hybrid approaches for automatic segmentation and annotation of a Chinese text corpus. Some experiment results are given. Hybrid approaches combine the rule-based method, the statistic-based method, and the automatic learning method. It is a good approach, and it can obviously improve the precision of segmentation and annotation of a Chinese text corpus.

In the processing of Chinese text corpora, two difficult problems must be resolved: one is the automatic segmentation of Chinese text, another is the automatic annotation of Chinese text. We shall discuss these difficult problems in this paper. 1. Automatic Segmentation: In a Chinese text corpus, the sentence is a continuum sequence of Chinese characters: there are no obvious delimiting markers (such as spaces in European languages) between Chinese words except for some punctuation marks. Therefore, word segmentation is essential in the automatic processing of a Chinese text corpus. The main approaches for automatic segmentation are as follows: –

Rule-based matching approach: – Maximum Matching method (MM method): Use a 6–8 Chinese character string as the maximum string; match the maximum string against the lexical entries in dictionary; if no match is found, one Chinese character is cut and until a corresponding word in the dictionary is found. The segmentation direction is from right to left. – Reverse Maximum Matching method (RMM method): Segmentation direction is from left to right. Our experiment shows that the RMM method is better than the MM method. – Bidirection Matching method (BM method): Compare the segmentation result of MM and RMM and then a correct segmentation can be obtained.

JB[v.20020404] Prn:31/05/2007; 9:27



F: BCT803.tex / p.2 (112-181)

Zhiwei Feng

– Optimum Matching method (OM method): In the dictionary, the order of entries was arranged by their frequency in the Chinese text, from higher frequency words to lower frequency words. The word with the highest frequency will be considered as optimum. – Association-Backtracking method (AB method): Use the association mechanism and backtracking mechanism for matching. –

Rule-based approach for dealing with the Ambiguous Segmentation Strings (ASS): There are two types of ASSs: – overlapping string: e.g., the overlapping segment. – combinative string: e.g., horse.

:

(graph) +

(at once):

(horse) +

(form),

becomes

(upper) = on the

To resolve these problems, various types of knowledge are necessary, various criteria are required. Part of speech (POS) information will be helpful. If we combine the POS annotation and automatic segmentation, the accuracy of segmentation will be increased remarkably. –

Hybrid approach (rule + statistics) for dealing with the Unregistered Words (URW): – URWs are mainly personal names, place names, institution names, new words. (Feng Zhiwei) – person name: e.g., – place name: e.g., (Tihany) – institution name: e.g., (TRADOS Company)

(diphacinone), the drug to kill the mouse. A new word is a kind of URW: e.g., Not all the URWs will be included in the dictionary. We use the hybrid approach to recognize the URW. –

–

Rule-based approach: The person name often appears in front of the title. For ” (Prof. Feng Zhiwei), (Feng example, in the noun phrase “ Zhiwei) is in front of the title “ ”(Prof.), so we can judge that (Feng Zhiwei) is a person name. Statistics-based approach: The Chinese character is used frequently as a family name, so we can judge that combined with the Chinese character must be a person name.

Ambiguous Segmentation Strings (ASS) and Unregistered Words (URW) are the primary difficult points for automatic word segmentation of Chinese text processing. The hybrid approach handles these difficult points best.

JB[v.20020404] Prn:31/05/2007; 9:27

F: BCT803.tex / p.3 (181-245)

Approaches for automatic segmentation of a Chinese text

2. Automatic Annotation: The automatic annotation for a Chinese text corpus is tagging the Chinese text with POS (Part Of Speech). The main approaches are as follows: –

Tagging with linguistic rules: For this approach, a serious problem is the POS ambiguity. In Chinese, POS ambiguity concentrates mainly on the frequently used words: verb, noun, adjective, etc. verb-noun ambiguity: verb-adjective ambiguity: noun-adjective ambiguity: adjective-adverb ambiguity: verb-preposition ambiguity: verb-adverb ambiguity: noun-verb-adjective ambiguity: noun-adverb ambiguity: other ambiguity:

37.6 % 24.3 % 10.4 % 4.55% 4.04% 2.27% 2.27% 2.02% 12.55%

The disambiguation of POS is based on the linguistic rules, especially the context. For example, “ ” (read as /bai/) is a word with adjective-adverb ambiguity; the disambiguation of “ ” can based on context: (The white swans play on the lake) POS - adjective, meaning - “white” (He makes a fruitless trip) POS - adverb, meaning - “in vain, for nothing”

–

The tagging precision rate is very low: 77%. Tagging with HMM (Hidden Markov Model): The processing procedure of POS tagging with HMM can be divided into the following steps: (1) Manually creating the training set, then manually annotating the training set and extracting the statistical data from this training set. (2) Constructing N-gram statistic model in accordance with the results from the statistical data of the training set: There are two kinds of parameters in the model: lexical probability (emission probability) and contextual probability (transition probability). (3) Generally the bi-gram model and tri-gram model are used in POS annotation. The higher the level of the model’s gram, the more exact the annotation.



JB[v.20020404] Prn:31/05/2007; 9:27



F: BCT803.tex / p.4 (245-314)

Zhiwei Feng

(4) Tagging the corpus with CLAWS (Constituent-Likelihood Automatic Word-tagging System) algorithm. The tag string with the maximum probability will be considered as the best result of POS tagging. Our results: Number of sentences 100 200 300 400 500 Precision rate 96.45 96.64 96.74 96.79 96.87

–

The tagging precision rate can reach 96.87%. Tagging with TBED (Transformation-Based Error-Driven) based on the Brill method: Eric Brill proposed the TBED method in the papers Transformation-based Error-driven Parsing (International Workshop on Parsing technologies, 1993a) and A Corpus-based Approach to Language Learning (Pennsylvania PhD dissertation in 1993b). According to the Brill method, we improved our approach. The tagging procedure is as follows:

(1) Initiating the text in the corpus and then obtaining the initial statistical result from the text. (2) Comparing the initial result with the manual tagged result (the correct result). (3) Generating the rule space of waiting to study in accordance with the rule template which was pre-defined. (4) Applying the rule space to annotate the corpus text to obtain the new effective rules. (5) Placing the new rules into the rule series and then the ordered rule series will be constructed. (6) Automatic learning of the rules according to the new ordered rule series. (7) Constructing the error matrix. (8) Dynamically tracing the learning process with the error matrix. (9) Optimizing rules. (10) The end of the automatic learning process. Our results: Number of sentences 100 200 300 400 500 Precision rate 94.89 95.07 95.09 95.14 95.16

The tagging precision rate can reach 95.16%. Because the initial statistical result was automatically obtained by a computer, the precision rate is not as high as the HMM approach.

JB[v.20020404] Prn:31/05/2007; 9:27

F: BCT803.tex / p.5 (314-389)

Approaches for automatic segmentation of a Chinese text

–

Hybrid (HMM + TBED) approach: (1) Initialization of algorithm with HMM. (2) Learning the rules with TBED. (3) Tagging the corpus with learned rules. Our results: Number of sentences 100 200 300 400 500 Precision rate 97.77 97.79 97.81 97.83 97.86

The tagging precision rate can reach 97.86%. The segmentation and annotation of the text corpus are two key tasks in Chinese text corpus processing. It is difficult to deal with the tasks with only a single method. The hybrid approach combines the rule-based method, statisticbased method, and automatic learning method: it can improve the precision of segmentation and tagging. After the segmentation and annotation of the text corpus, we shall parse the sentence and give the bracket to every phrase and sentence of the text. In this case, the text corpus shall become the tree bank. In this paper, we concentrate on the segmentation and annotation; we shall discuss the bracketing and parsing in another paper.

Appendix Original Text:

English translation: The sun has set. I look at the cloud. It likes as the mountain, as the sea. The mountain is red. The sea is also red. On the mountain there is not the tree, on the sea there is not the fish. Ah! I love the red mountain, I also love the red sea. I love the nice cloud.



JB[v.20020404] Prn:31/05/2007; 9:27



F: BCT803.tex / p.6 (389-433)

Zhiwei Feng

Segmentation:

Annotation (Tagging):

Bracketing:

Explanation of tags: Noun - n,Verb - v, Adjective - a, Number word - m, Pronoun - r, Measure word - q, Adverb - d, Mood word - y, Auxiliary word for grammar - u, Position word - f, Punctuation - w, Sentence - zj, Clause - fj, Separate sentence - dj, Noun phrase - np, Verb phrase - vp, Quantitative phrase - mp, Duplication form of verb - vbar.

JB[v.20020404] Prn:31/05/2007; 9:27

F: BCT803.tex / p.7 (433-457)

Approaches for automatic segmentation of a Chinese text

References Brill, E. 1993a. “Transformation-based Error-driven Parsing.” International Workshop on Parsing Technologies. Brill, E. 1993b. A Corpus-based Approach to Language Learning. Pennsylvania PhD Dissertation. Feng Zhiwei. 1995. Natural Language Processing By Computer. Foreign Language Education Publishing House, Shanghai. Feng Zhiwei. 1995. Foundation of Computational Linguistics. Commercial Press, Beijing.



JB[v.20020404] Prn:15/02/2007; 13:51

F: BCT804.tex / p.1 (45-109)

Distance between languages as measured by the minimal-entropy model Plato’s Republic – Slovenian versus 15 other translations Primož Jakopin

In this paper, a language model, based on probabilities of text n-grams, is used as a measure of distance between Slovenian and 15 other European languages. During the construction of the model, a Huffman tree is generated from all the n-grams (n = 1 to 32, frequency 2 or more) in the training corpus of Slovenian literary texts (2.7 million words), and appropriate Huffman codes are computed for every leaf in the tree. To apply the model to a new text sample, it is cut into n-grams (1–32) in such a way that the sum of model Huffman code lengths for all the obtained n-grams of new text is minimal. The above model, applied to all (16) translations of Plato’s Republic from the TELRI CD ROM, produced the following language order (average coding length in bits per character): Slovenian (2,37), Serbocroatian (3,77), Croatian (3,84), Bulgarian (3,96), Czech (4,10), Polish (4,32), Russian (4,46), Slovak (4,46), Latvian (4,74), Lithuanian (4,94), English (5,40), French (5,67), German (5,69), Romanian (5,76), Finnish (6,11), and Hungarian (6,47).

.

Introduction

Entropy is a topic at the very core of information theory, the theory which established a close connection between the elements of language, characters of written text, and calculus of electrical signals in communication as it was founded half a century ago by Claude E. Shannon (Shannon 1948). Entropy itself is a measure of indetermination for a given system of messages – the bigger its entropy, the more information required to describe it; a system where all messages are known in advance has zero entropy. When applied to text, entropy is usually connected to indetermination of the next, so far, unread letter of text. The bigger the entropy of text, the smaller the chance that a random reader will correctly guess

JB[v.20020404] Prn:15/02/2007; 13:51



F: BCT804.tex / p.2 (109-191)

Primož Jakopin

the invisible letter n+1 after having read the previous n letters of text. To the dismay of early scholars, computers were not available at the time, and so experiments were limited to small text samples with letter occurrences counted manually. As time went by, entropy slowly faded away from the limelight; one could say that it even became an old-fashioned topic and has only recently been rediscovered as a useful tool in many areas linked to text processing (for example Ratnaparkhi 1998). The enormous computing power now at virtually everybody´s fingertips has made feasible the memory- and CPU-intensive methods that have to handle large substring sets of a given text or even text corpus. . Minimal-entropy language model During the research aimed at estimating the upper bound of entropy for Slovenian literary texts (Jakopin 1999, http://www.ff.uni-lj.si/ hp/pj/disertacija), several language models for Slovenian have been constructed, aimed to confirm the interpolated entropy value of 2,2 bits per character. The larger of two statistically evaluated text corpora, 2.7 million words (60 works written by 41 authors, 46 original and 14 translations, dated from 1858 to 1996), served as a training corpus to obtain the parameters of the model which was later tested on the second, smaller corpus of 0.4 million words (52 works published between 1931 and 1988, the complete opus of a single author, Ciril Kosmaˇc). Both corpora together represent roughly 0,5–1% of the total Slovenian literary output. The best language model was based on the assumption that it is possible to generate a frequency dictionary of all character n-grams up to a given length and frequency from the training corpus and to use it as a coding data base for transmitting any new text of the same language as a file of codes of minimal total length, provided the sender and the receiver both possess the dictionary, the basic knowledge of a certain language in form of character string probabilities. . Dictionary The preparation of the dictionary of the above language model can be summarized in 4 steps: (1) compose a dictionary of all possible n-grams up to a given n which have a required frequency from the reference text corpus; (2) sort the dictionary in descending order on frequency; (3) generate a Huffman tree and compute Huffman code lengths for all the entries; (4) make a target dictionary, a lookup-table, abc sorted, of entries and their H. code lengths.

JB[v.20020404] Prn:15/02/2007; 13:51

F: BCT804.tex / p.3 (191-259)

Distance between languages

Table 1. Plato’s Republic: top 60 n-grams from the English translation with frequencies _ e t a o i n s h r e_ _t

129917 66081 55065 41621 41077 39275 38034 37917 35226 28956 23823 22567

d th l he _th u _a t_ the s_ f c

21776 20420 20111 16781 16728 16329 14653 13985 12618 12549 12542 12230

m d_ _the w , ,_ y in an he_ _i _o

12127 11724 11044 10875 10773 10728 10674 9912 9574 9466 9405 9210

_s p g _w nd n_ b er the_ _the_ at re

9113 8792 8743 8308 8063 7996 7710 7663 7383 7372 7127 6980

r_ nd_ _an y_ ha o_ _h is it f_ and on

6966 6655 6447 6439 6398 6124 5872 5827 5803 5704 5661 5643

Table 2. Top 60 English n-grams with Huffman codes and Huffman code lengths _ e t a o i n s h r e_ _t

000111 6 0000100 7 0100010 7 1010100 7 1010110 7 1100101 7 1101011 7 1101100 7 1111011 7 00110010 8 10000001 8 10001101 8

d th l he _th u _a t_ the s_ f c

10011011 8 11000000 8 11000100 8 000001001 9 000001010 9 000110011 9 001011111 9 001111111 9 010110000 9 010110010 9 010110100 9 011111000 9

m d_ _the w , ,_ y in an he_ _i _o

011111011 9 100001100 9 100101111 9 100111000 9 100111011 9 101000011 9 101000110 9 110010000 9 110101001 9 110111000 9 110111011 9 111010101 9

_s p g _w nd n_ b er the_ _the_ at re

111011011 9 111110001 9 111110100 9 0000011011 10 0001101101 10 0010000010 10 0010010110 10 0010100110 10 0010111000 10 0010111010 10 0011110010 10 0100000000 10

r_ nd_ _an y_ ha o_ _h is it f_ and on

0100000011 10 0100111000 10 0101001000 10 0101001010 10 0101010110 10 0111101101 10 1000010111 10 1000011101 10 1000100010 10 1000101110 10 1000110011 10 1000111001 10

To illustrate such a dictionary, data obtained from the English translation of Plato’s Republic, as published on TELRI CD ROM (Erjavec et al. 1998) are shown in Tables 1 and 2. In Table 1, the top 60 character n-grams with frequencies are given. As expected, the <Space> character (shown as an underscore) is quite on top with 129 917 occurrences in the book of 692 058 characters (18.8% of all). It is followed by letters e (66 081 occurrences or 9.6%), t, a, o, i, n, s, h, and r with the frequency of 28 956 or 4.2%. In the entire table, there are 22 single characters, including <Space> and comma, 29 bigrams, 6 trigrams, and 2 4-grams (_the and the_), and 1 5-gram (_the_). In Table 2, the beginning of the dictionary of n-grams with their respective Huffman codes and their lengths, still sorted in descending order on frequency (after step 3), is shown. The <Space> character on top has a Huffman code 000111, 6 bits long; the following 8 letters e, t, a, o, i, n, s, h have Huffman codes which are 7 bits long; the next 6 n-grams have 8-bit codes; the next 24 n-grams 9-bit codes, and the remaining 21 n-grams in the table have Huffman codes which are 10 bits long. Huffman coding has been selected as it guarantees the shortest total length of a coded text even for finite length of the latter.



JB[v.20020404] Prn:15/02/2007; 13:51



F: BCT804.tex / p.4 (259-351)

Primož Jakopin

. Algorithm Because the dictionary based on a suitable training corpus of the language in question has been prepared, it is now possible to proceed with the algorithm of the model. It cuts a new text on which it is applied into a sequence of character strings in such a way that the resulting file of Huffman codes, which represent the text, is the shortest possible, and it consists of 3 steps: (1) at the current position in new text, find the longest n-gram from the dictionary which fits and put temporary dividing point at its end in text; (2) move the dividing point backwards place by place until the number of bits per character for the combined 2 n-grams – the one from the current point to the dividing point and the longest fitting one from the dividing point on – will not be worse (bigger); (3) make the obtained dividing point the new current position and proceed with step 1 until the end of text is reached. To illustrate the algorithm, let us again take the dictionary obtained from all the n-grams up to n = 32 and with a frequency 2 or more (which occurred at least twice) of the English translation of Plato’s Republic and use it to cut the sentence: Tell_me,_then,_what_is_the_nature_of_this_faculty_of_dialectic? It is a typical discourse sentence from Plato’s work; spaces have again been replaced by underscores. To obtain the first dividing point, we need the dictionary n-grams from Table 3. Table 3. Substrings engaged in the first division of above sentence with H. code lengths T 13 Te 18 Tel 18 Tell 18 Tell_ 18 Tell_m 18 Tell_me 18 Tell_me, 19 Tell_me,_ 19 Tell_me,_t 20 Tell_me,_th 20 Tell_me,_the 20 Tell_me,_then 20 Tell_me,_then, 20 Tell_me,_then,_ 20

w9 wh 11 wha 13 what 13 what_ 13 what_i 16 what_is 16 what_is_ 16 what_is_t 18 what_is_th 18 what_is_the 18 what_is_the_ 19

_6 _w 10 _wh 11 _wha 13 _what 13 _what_ 13 _what_i 16 _what_is 16 _what_is_ 16 _what_is_t 18 _what_is_th 18 _what_is_the 18 _what_is_the_ 19

,9 ,_ 9 ,_w 13 ,_wh 14 ,_wha 17 ,_what 17 ,_what_ 17 ,_what_i 19 ,_what_is 20 ,_what_is_ 20 ,_what_is_t 21 ,_what_is_th 21 ,_what_is_the 21 ,_what_is_the_ 21

JB[v.20020404] Prn:15/02/2007; 13:51

F: BCT804.tex / p.5 (351-419)

Distance between languages

Letter T has a 13-bit long Huffman code, and in the dictionary, all subsequent character strings from the beginning of sentence up to Tell_me,_then,_ (Huffman code length 20 bits) can also be found. Therefore we put the temporary dividing point after the third space in the sentence. The longest following string in the dictionary, which fits after the third space in the sentence, is what_is_the_ with a Huffman code length of 19. So the temporary result is 1.44 bits per character (two strings with a total length of 27 characters [15 plus 12], coded by 39 bits [20 plus 19]). To make sure that the dividing point is optimal, the algorithm proceeds with a point one place backwards. This gives the strings Tell_me,_then, (20 bits) and _what_is_the_ (19 bits) with the same average value of 1.44 bits per character (again 39 divided by 27). The next iteration, however, with the strings Tell_me,_then (20 bits) and ,_what_is_the_ (21 bits) gives a poorer value of 1.52 bits per character (41 bits divided by 27). The iteration stops here; the temporary dividing point can be assumed to be correct, and the algorithm may proceed with the next dividing point. The sentence of the example would be cut into the following 5 character strings: Tell_me,_then,_|what_is_the|_nature_of_this_|faculty_of_|dialectic?| . Training corpus As already stated, the n-gram model for Slovenian was based on the literary corpus of 2.7 million words. The n-gram dictionary built from it was quite considerable in size: the number of different n-grams grows very fast as the frequency threshold is lowered as shown in Table 4. The length limit of strings considered, set at 32, was on the safe side, as verified in Figure 1, where length distributions are shown for different and all n-grams of the training corpus (length marked as d and probability as p). The distribution of different n-grams is asymmetric to the right with rapid rise to the maximum of 9 characters (nearly 12%) and then slowly tails off towards the length of 30. From the right, more important histogram, where all n-grams are taken into account, it is clear that the limit could safely be set at a lower figure of 20.

Table 4. Number of n-grams (n = 1–32) from 2.7 MW corpus for different minimal frequencies minimal frequency

different n-grams

all n-grams

10 5 2

1.454.567 3.427.898 16.401.724

120.517.900 133.038.634 164.019.971



JB[v.20020404] Prn:15/02/2007; 13:51



F: BCT804.tex / p.6 (419-476)

Primož Jakopin

Figure 1. Length distribution of different (left) and all (right) n-grams (frequency ≥2) from the 2.7 MW corpus

. Plato’s Republic After the n-gram model for Slovenian was built and tested, a question surfaced almost immediately – how would it fare if applied to some other language, be it of the same language family or a more distant one. In search of a suitable text in various languages, it was impossible to overlook the classic masterpiece, Republic by Plato, extensively researched during the TELRI I project (1995–1998) and available in several European languages on TELRI CD ROM (Erjavec et al. 1998). This set of translations in itself is a remarkable achievement of an inspiring and far-reaching initiative. The languages included (abbreviations are given in parentheses) are: Bulgarian (bg), Croatian (cr), Czech (cs), English (en), Finninsh (fi), French (fr), German (ge), Hungarian (hu), Lithuanian (lt), Latvian (lv), Polish (pl), Romanian (ro), Russian (ru), Serbocroatian (sc), Slovak (sk), and Slovenian (sl). All the texts have been, within limits, unified and transcribed into Latin. The Bulgarian translation on CD has already been transcribed with hacek characters ˇc, š, ž interpreted as ch, sh and zh. The Russian translation was given in Cyrillic coding and transcribed in a standard manner for this task. The texts in Czech, Slovak, and Polish languages have been stripped of their diacritic characters on vowels to obtain a more realistic behaviour of the model. The richness and variety Table 5. Plato’s Republic – translations on TELRI CD ROM 16 176 9.954.979 1.642.597

languages different characters characters words

JB[v.20020404] Prn:15/02/2007; 13:51

F: BCT804.tex / p.7 (476-564)

Distance between languages

Table 6. Character frequencies for all the 16 translations of Plato’s Republic combined

Table 7. Plato’s Republic translations – the most common letters with frequencies in percentage bg 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

a o e i t n v s r d k l

15.5 10.0 8.9 8.3 8.3 6.2 5.0 4.8 4.2 3.8 3.2 2.9

cr a i o e n t j d s r k u

11.9 10.5 10.3 8.9 6.0 5.0 4.7 4.4 4.4 4.3 3.8 3.6

cs o e a n t s v i l d m p

8.8 8.2 6.4 6.1 6.0 4.6 4.5 4.2 3.8 3.8 3.7 3.2

en e t a i o n s h r d l u

12.2 10.3 7.9 7.6 7.6 7.0 7.0 6.5 5.3 4.0 3.7 3.0

fi a i t n e s ä l o k u m

11.7 11.2 9.8 9.1 8.3 7.4 6.1 5.2 5.1 4.9 4.3 3.9

fr e s i t n a u r l o c d

15.2 8.9 7.6 7.2 7.1 6.8 6.7 6.0 5.7 5.4 3.6 3.6

ge e n i r s t d h a u l c

17.1 11.4 7.9 7.0 6.3 5.6 5.5 5.3 5.1 3.9 3.4 3.4

hu e a n t s l k z g m i o

10.4 9.3 6.8 6.6 6.5 6.2 5.6 5.0 4.2 4.0 4.0 3.9

lt i a s t e r k u n o m p

14.7 12.6 8.0 6.3 5.4 5.4 5.0 4.8 4.8 4.6 3.4 2.7

lv a i s t e u r n a¯ k m v

11.8 10.0 7.9 7.6 6.0 5.1 4.8 4.6 4.3 4.1 3.6 3.6

pl i a e o z n w t y s d c

9.7 8.6 8.3 7.4 5.7 4.9 4.4 4.3 4.1 3.9 3.9 3.7

ro e i a r c t n u s à l d

12.9 10.1 9.7 6.8 6.3 6.1 5.9 5.9 4.3 4.2 4.1 3.5

ru o e a t i n s v r l d m

11.6 9.4 8.9 7.5 7.1 6.3 5.6 4.5 3.7 3.7 3.5 3.5

sc a o i e n t s r d j m u

11.7 11.1 10.0 8.8 6.1 4.8 4.6 4.5 4.2 4.2 3.8 3.7

sk o a e n i t s v r k m d

9.9 9.6 8.6 5.5 5.5 5.5 5.4 4.2 4.0 3.9 3.7 3.4

sl a e o i n r t j s v m l

10.3 10.1 9.8 9.1 6.8 5.3 4.9 4.8 4.7 4.1 3.7 3.7

of the languages can be observed even on the basic statistical level; the frequencies of all the characters of all 16 translations combined as one file are shown in Table 6. To gain better insight into basic quantitative language features let us add some more easily derived data such as the top 12 letters for any of the 16 languages, which are presented in Table 7. While the vowels a, e, i, and o typically occupy the first positions in Slavonic languages, some other highly placed letters, such as t in English or n in German, also have a lot to tell. Quite interesting, and somewhat closer to a linguistic eye, are the lists of the most common wordforms, shown in Table 8. The footprint of a particular lan-



JB[v.20020404] Prn:15/02/2007; 13:51



F: BCT804.tex / p.8 (564-705)

Primož Jakopin

Table 8. Plato’s Republic translations – the most common wordforms

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

bg

cr

cs

en

fi

fr

ge

hu

lt

lv

pl

ro

ru

sc

sk

sl

i da na se e ne za ot tova toj cˇ e po

i da se je ne u a što bi to za dakle

a se to je že v na by co i o tak

the and of to that is in it he said a I

ja on että niin se ei kuin hän mutta jos siis sen

de et à la que il les qui le ce en pas

und die der nicht sie das den er zu in ist wir

a az hogy Szókratész s nem és is Glaukón ha azt e

ir kad o tai yra ar iš jis taip kaip tik ne

un ir par vai t¯a ka bet kas tas to k¯a ar¯a

i to nie si¸e a w z˙ e z na tak jest co

s¸ i de sà nu în cà mai pe ce cu se ar

i ne cˇ to v eto a to kak tak že my ty

i da je se u ne to a što tako na bi

a sa Sokrates to je že Glaukon ako v na cˇ o aj

in je ne se v da ki ali to za tako pa

guage is even more visible from this table as well as an unusual connection between translations for quite different, yet geografically close languages, Hungarian and Slovak.

. Distance between Slovenian and other languages The model using the dictionary derived from the 2.7 million sample of Slovenian literary texts (which included Plato’s Republic) has been applied to the set of 16 texts, 16 translations of Plato’s work. The results and some additional data are summarized in Table 9. The table is sorted in descending order on the average number of bits per character achieved during the coding of every text by the model. Every line in the table has 8 entries: rank according to value in the last column; language name; translator; publication year of the edition, used for electronic version of the text (it is usually the same as the year of translation); first person of the team responsible for the transfer into electronic form; number of words in translated text; number of characters; and the average number of bits per character produced by the model. Unknown values are indicated by a hyphen. As could be expected from linguistic common sense, according to the model the two languages most close to Slovenian (coded to 2,37 bits per character by the model) are Serbian (3,77 bits per character) and Croatian (3,84), followed by Bulgarian (3,96), Czech (4,10), Polish (4,32), Russian (4,46), and Slovak (likewise 4,46 bits per character). It is interesting to notice that, at least according to this model, of the two most distant languages, Finnish (6,11) is closer to Slovenian than Hungarian (6,47).

JB[v.20020404] Prn:15/02/2007; 13:51

F: BCT804.tex / p.9 (705-747)

Distance between languages

Table 9. Minimal-entropy model for Slovenian, applied to translations of Plato’s Republic

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

language

translation

year

electr. version

Slovenian Serbocroatian Croatian Bulgarian Czech Polish Russian Slovak Latvian Lithuanian English French German Romanian Finnish Hungarian

Jože Košar A. Vilhar, B. Pavlovi´c Damir Salopek – Radislav Hošek Wladyslaw Witwicki – Július Španár Gustavs Lukstinš Jonas Dumˇcius Paul Shorey – Karl Vretska Andrei Cornea Marja Itkonen-Kaila Szabó Miklós

1976 1983 1976 – 1993 1991 – 1990 1982 1981 – 1993 1982 1986 – 1984

Primož Jakopin Duško Vitas Marko Tadi´c Patrice Bonhomme ˇ František Cermák Michal Jankowski – Alexandra Jarošová Andrejs Spektors – – – Joachim Hohwieler Dan Tufis Anna Mauranen Tamás Váradi

words

characters

b/c

92.741 107.506 92.870 112.676 110.466 107.559 99.503 99.661 45.238 85.144 129.331 142.624 104.876 131.064 75.800 105.538

565.604 613.082 532.497 678.131 636.201 645.532 649.102 622.463 290.508 584.318 692.058 817.658 641.333 658.804 582.522 728.501

2,37 3,77 3,84 3,96 4,10 4,32 4,46 4,46 4,74 4,94 5,40 5,67 5,69 5,76 6,11 6,47

. Conclusion The paper presents a method which is based on quite a simple idea yet which relies heavily on the brute force of today’s powerful computers. The method has helped to quantify the distance between Slovenian and 15 other European languages and could be used, as suitable text corpora in different languages become more widely available, to make a cross-table where every language is compared to all the others. It would add a new point of view to the understanding of an old problem.

References Erjavec, T., A. Lawson and L. Romary (eds.). 1998. “East Meets West – A Compendium of Multilingual Resources.” TELRI, Institut für Deutsche Sprache, Mannheim. Jakopin, P. 1999. “Upper Bound of Entropy in Slovenian Literary Texts.” PhD Dissertation, Faculty of Electrical Engineering, University of Ljubljana. Ratnaparkhi, A. 1998. “Maximum Entropy Model for Natural Language Ambiguity Resolution.” PhD Dissertation, Dept. of Computer and Information, University of Pennsylvania. Shannon, C. E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal 27: 379–423, 623–656.



JB[v.20020404] Prn:15/02/2007; 13:55

F: BCT805.tex / p.1 (46-162)

The importance of the syntagmatic dimension in the multilingual lexical database R¯uta Marcinkeviˇcien˙e This paper describes the idea behind a multilingual database (Muldi) designed to incorporate five constituent parts: monolingual and multilingual corpora, monolingual lexicons, lists of translation equivalents, and terminological records. The emphasis in Muldi is on the presentation, analysis, and description of syntagmatic information contained in lexical items. Types of translation equivalents as well as the problem of relationship between dictionary and corpus translation equivalents is also considered.

.

Introduction

Traditional dictionaries differ in the amount of paradigmatic and syntagmatic information they provide: the number of synonyms, antonyms, hypernyms, and hyponyms, versus idioms, phrases, word combinations, and collocations. Syntactic patterns and example sentences also belong to the syntagmatic dimension. Traditional dictionaries, since they are of a limited size and have other restrictions inherent to book technology, can include only selected information of both types. By contrast, electronic dictionaries or lexical DBs are devoid of the limitations of two-dimensional presentation; they are free to include more syntagmatic information of various types. Storage and the amount of contextual evidence is not a problem in multilingual lexical databases (DBs). Nevertheless, due to tradition or the existing views on language structure, paradigmatic information predominates even in some well-known multilingual eletronic lexical DBs, for example, EuroWordNet with its eleborate system of synsets and hierarchical word relations such as synonymy, antonymy, hyponimy, meronymy, etc. The aim of this paper is to demonstrate the value and the importance of the syntagmatic dimension in multilingual lexical DBs by presenting one possibility, that is, a particular multilingual lexical DB and the idea behind it.

JB[v.20020404] Prn:15/02/2007; 13:55



F: BCT805.tex / p.2 (162-196)

R¯uta Marcinkeviˇcien˙e

. Muldi (Multi Lingual Database Interface) During the first decade of Lithuanian independence (1990–1999), a great number of books in philosophy and social sciences were translated into Lithuanian to compensate for the 50 years of imposed Soviet ideology in these two areas. As a result, many neologisms appeared while previously used words changed their meaning. These changes, together with their source language counterparts, were recorded by the translators in the traditional manner in filing cabinets. Newly coined words, together with the old ones, formed a collection of terms that was to be made available to the public. For the sake of accuracy (since we can hardly talk about such a phenomenon as ‘terms of philosophy’), ‘terms’ will be called ‘keywords in philosophy and social sciences’. Multilingual lexical items were presented as isolated words and word combinations; therefore, they had to be suplemented with their immediate contexts for extraction of factual, linguistic, pragmatic, and other types of information. It was decided to give this ever growing collection of multilingual lexical items and the related corpus of texts the form of a lexical DB. This is how Muldi came into being. Muldi is the product of a joint venture by two universities, Twente University (The Netherlands) and Vytautas Magnus University (Lithuania), and it was designed by Jeroen Klumper in 1998 in Kaunas under the supervision of Dr. Jan Kuper and myself. Muldi is an object-oriented tool. This object-oriented character causes a relatively simple extendibility, an independent behaviour of various parts of the tool. From the lexicographic point of view, Muldi is designed to supply translators and other interested persons with lexicons in eight languages (Latin, Old Greek, French, German, English, Russian, Polish, and Lithuanian). These lexicons are also linked by translation equivalences, in most cases from all seven languages into Lithuanian, and also from one language into another so that a string of TEs in all eight languages can be formed. Muldi includes the following components: monolingual lexicons, monolingual corpora, terminological records, list of TEs, and, if available, parallel concordances. I will discuss the importance of each one of these components for the presentation of lexical items. In addition, the relationship between dictionary and corpus TEs will be elaborated. . Monolingual corpora The monolingual corpora form the point of departure for the construction of the DB since the keywords in philosophy in the text used are derived from them. Monolingual corpora usually are made of original texts. In the case of Lithuanian, however, texts of translations are of equal importance since they can be used both for later alignment and compilation of parallel corpora and as texts on their

JB[v.20020404] Prn:15/02/2007; 13:55

F: BCT805.tex / p.3 (196-247)

Syntagmatic dimension in the multilingual lexical database

own. As such, they can be compared for all possible types of analyses, including translationese. Texts translated into other languages are also useful for similar purposes (for obvious reasons Old Greek and Latin corpora contain only original texts). Monolingual corpora, if they meet requirements of equal size, design, and composition, form a comparable corpus. Even if they are not comparable, monolingual corpora complement the extracted lexical items since they contain sentences that give the context needed to expand a lexical item to the limits of the translational unit. It is repeatedly stated by a number of authors that the equivalent is established for a word as a constituent of a lexical and/or grammatical pattern, not for the isolated lexical items. In other words, translation units are usually phrases (Sinclair 1996b, Tognini-Bonelli 1996, Teubert 1996). Thus concordances can be obtained from monolingual corpora, and linguistic contextual behavior of a lexical item can be analyzed. In the absence of a parallel corpus, comparable monolingual corpora can be used by translators to find out whether their candidate for TE is a naturally used lexical item of the target language for the context pattern under consideration. . Monolingual lexicons The monolingual lexicons are lists of the keywords in each of the above mentioned languages. They can be extracted manually or semi-automatically from both monolingual and parallel corpora. If devoid of their equivalence relations, lexicons can be independent lists of words in eight languages: Latin, Old Greek, French, German, English, Russian, Polish, and Lithuanian. They differ both in size and content. At the moment there are ca. 18,000 entries or, rather, strings of equivalents with ca. 10,000 Lithuanian, ca. 7,000 English, 13,000 German, 9,000 Russian, 200 Polish, and 600 Latin keywords in philosophy. The initial lists of equivalents have been derived from parallel corpora manually; later on it will be possible to obtain them automatically from electronically stored parallel corpora. A closer look at the filing cabinet revealed that in most cases the lexicons consist of abstract nouns used in a wide range of contexts not limited to texts of philosophy. It did not happen by chance that abstract nouns required evidence about their contextual patterns. Abstract nouns denote fuzzy concepts and belong to the so-called empty lexicon (Sinclair 1996a: 113) that is unstable and depends heavily on lexical relationships and lexical contexts for their meaning. It manifests itself in the fact that abstract nouns, much more often than the concrete ones, form the basis for both metaphorical and non-metaphorical collocations (Lakoff 1980: 51, Kövecses 1986: 132, Alverson 1994: 41). These collocations, that is, words in their frequent and regular contexts, form translation units that influence considerably the choice of TEs. Moreover, nouns and not predicative parts of



JB[v.20020404] Prn:15/02/2007; 13:55



F: BCT805.tex / p.4 (247-313)

R¯uta Marcinkeviˇcien˙e

speech serve as anchor points in translation; therefore, they are primary and more important than the latter for comparative studies. Finally, translation of abstract nouns poses additional problems because of numerous cognates that are the best candidates for the so-called false friends of translators. . Terminological records The so-called terminological records (the concept of ‘terminological record’ is found suitable and taken from Sager 1990: 130–163) are based on information derived from the monolingual corpora. A multilingual DB would be deficient if it consisted only of two items – isolated lexical items and corpora. The users of such a DB, if they failed to find a suitable TE, would have to deal with a great amount of data derived from the corpora. It is unlikely that anyone would have enough time to digest a concordance of 50,000 lines, which is typical of mediumfrequency words in corpora a containing ca. 50,000 million words. Therefore an intermediate level, the so-called terminological record, has been designed to contain corpus-based lexical entries. These entries provide factual, pragmatic, and linguistic information as suggested by terminologists. Linguistic information, in addition to links to paradigmatically related nouns such as synonyms, antonyms, hypernyms, and hyponyms should present generalizations from concordances, permitting a considerable reduction of the material. In other words, linguistic information should also have a syntagmatic dimension. Actually it should present the user with a contextual/collocational or usage pattern/profile as it reveals itself in a corpus of texts in the original language (see Hanks 1998 for a fine example of a usage profile). This pattern or profile belongs to the realm of lexical syntax, the field in which most of the existing traditional dictionaries are deficient due to their objective limitations and restrictions (cf. Teubert 1996: 241, Sinclair 1999). A corpus should be large enough as well as representative enough to allow such generalizations of the context. If properly derived, a contextual profile can help the user to find a suitable equivalent in the target language. The size of a terminological record also poses a problem. If too big, it could be as time-consuming for a user as the concordance itself. Whatever a contextual profile might contain (the most relevant syntactical patterns together in with lists of collocates, complete phrases, or idioms) no matter how long or short it might be, it should give a very clear idea about a word in context. . Parallel corpora Parallel corpora in Muldi are original philosophical texts aligned to a paragraph and/or sentence level with their translations into one or several languages. Though

JB[v.20020404] Prn:15/02/2007; 13:55

F: BCT805.tex / p.5 (313-356)

Syntagmatic dimension in the multilingual lexical database

not readily available as monolingual corpora, parallel corpora form a very important part of a multilingual DB. It has been observed by many authors that parallel and multilingual corpora can offer a rich source of translation equivalence and thus contribute considerably to bilingual and multilingual lexicography. Parallel corpora are especially helpful in the case of context-dependent translation equivalence. Traditional bilingual dictionaries are very limited in this respect except for those providing numerous phrases with the entry word. Nevertheless, the number of contextual uses of an entry word usually does not exceed several dozen even in the case of frequently used polisemous words. Therefore parallel corpora are of paramount importance for the analysis of translation equivalence especially since they contain two parts – the SL text, on one hand, which reveals the context of a translation unit and motivates the choice of an appropriate TE, and the TL text which displays a variety of recurrent or unique TEs. It has been rightly observed that “. . . a parallel corpus of a reasonable size contains more knowledge about translational equivalence than any bilingual desk dictionary” (Teubert 1996: 249). The implication behind traditional bilingual lexicography is that it is inefficient and does not provide a full range of possible translation equivalents. Besides, those given as TEs in a dictionary cannot be used in a wide range of contexts. Researchers report that dictionary TEs seldom coincide with real TEs found in parallel corpora, because prototypical equivalents are not usable in most of the contexts. They are helpful to understand a lexical unit in the SL, but, in order to translate it, other TEs have to be used. It seems that the relationship between translation equivalents given in a bilingual dictionary and those derived from parallel corpora demands a closer look. . Translation equivalents Translation equivalence is a very loose term. It denotes a scale of resemblance or a degree of equivalence that hardly ever approaches 100% (Teubert 1996: 247). The list of TEs presents one or several possible target language alternatives for a source language lexical item. TEs range from the so-called systemic, prototypical, canonical, that is, context-free dictionary equivalence, to context dependent or contextual equivalents. Since these two extreme points on the scale of translation equivalence do not exhaust the types of possible TEs, it is useful to refer to the typology of TEs given by translation theorists. Since linguistic theory of translation is based on the comparison of two texts, one in the source and the other in the target languages, equivalence is understood as the relationship between two texts and not two languages. Typology of TEs is also based on the level of a text. It is obvious that textual equivalence differs from the linguistic equivalence that exists on the level of comparative studies



JB[v.20020404] Prn:15/02/2007; 13:55



F: BCT805.tex / p.6 (356-407)

R¯uta Marcinkeviˇcien˙e

of two languages. The latter takes into account the relationship between two systems and not their particular manifestations in a specific text. Thus the theory of translation equivalence, to the degree that it takes systemic relationships into consideration, can be equally helpful. Parallel corpora are collections of texts, therefore, TEs obtained from them can be treated as remaining on purely textual level. Nevertheless, if they are big enough and if lexical items under consideration are used frequently, cross-linguistic generelizations can safely be made. They outweigh text-dependent or translator-subjective cases and are useful for bilingual or multilingual lexicography. Very few translation theorists take into consideration the systemic correspondences that are usually given in bilingual dictionaries. Recker’s theory, based on regular correspondences between the source language and the target language, is an exception (a difference in terminology has to be noted here, in Soviet linguistics, the generic term ‘correspondences’ is used instead of ‘equivalence’; the latter is reserved for a specific type of correspondence). These corrrespondences can be classified into 1) equivalents, 2) alternative or contextual correspondences, and 3) all kinds of translation transformations. Equivalents are said to be independent of context, constant and regular correspondences, coinciding with systemic or dictionary equivalents, having parallel grammatical structures. Terms and nonpolysemous words belong here. Equivalents are language phenomena, while correspondences and transformations are considered as belonging to the textual level. Translation transformations occur due to various kinds of text-dependent translation procedures such as generalization, concretisation, logical developments of concepts, compensations (Recker 1974, also Koptjevskaja-Tamm 1989). This typology of TEs is more exhaustive than a mere juxtaposition of systemic and contextual or dictionary and corpus equivalence. Besides regular and independent equivalents on one end of the scale and translation transformations on the other, it includes the intermediate type of TEs, that is, contextual translation correspondences. The choice of contextual correspondencies is caused by structural differences between the two languages and is governed by contextual factors. Still, they are regular and predictable for the same pair of languages and the same contextual situations; therefore, their inclusion into bilingual dictionaries, together with the immediate context, is most desirable. Thus collocations become an important issue in the grounding of translation equivalence. All three types of correspondences are closely connected with the boundaries of the unit of translation. Equivalents appear on the level of isolated lexical items; the choice of correspondences is governed by the immediate context, while transformations occur due to the translator’s attempts to make up for the inevitable loss of information on the level of the whole text. If both SL and TL nouns are stripped of their immediate contexts, some of the particularly contextdependent TEs might seem inappropriate. Nevertheless, they can fit some contexts

JB[v.20020404] Prn:15/02/2007; 13:55

F: BCT805.tex / p.7 (407-512)

Syntagmatic dimension in the multilingual lexical database

better than systemic or dictionary equivalents. Transformational correspondences (sometimes also called functional or communicative) are caused by much wider contexts than collocations, sometimes by the message and style and genre of the translated text, sometimes by the subjective preferences of the translator. Being irregular, text-dependent, and unpredictable, they are of minor importance for comparative studies and lexicography. As it will be demonstrated later, the three types of correspondences fully cover the factual data. How do the three types of correspondences that show up in a parallel corpus relate to dictionary equivalents? Some researchers claim that “The dictionary (prototypical) equivalent of a source language word (in one particular) meaning is assumed to cover the conceptual core of the former; therefore, it is intended to convey the definition of the source language word meaning given in a monolingual dictionary. The prototypical equivalent is unusable in many contexts” (Jarošová 1997: 73). According to this view, lexical units that are given as TEs in traditional bilingual dictionaries seem to serve as explanations of different meanings of the isolated word under consideration and not as ‘real life’ correspondences in the TL. We tried to investigate the relationship between dictionary and corpus TEs with the help of a group of six lexical items chosen for the comparison of their dictionary and textual TEs. Table 1 shows TEs from the English – Lithuanian parallel corpus of Orwell’s novel 1984 for each of the six items. The “total” column gives the total number of the occurences in the corpus. The “multiple occurences” column reveals how many of the occurences are translated in the same way more than once (the number of times is given next to the item); the “single occurences” column shows how many TEs occur only once. The final column indicates those cases where no equivalent is discernable because the translator uses a completely different structure or avoids using a TE for some other reason. These TEs, given in bold, coincide with the dictionary equivalents. Such a presentation of TEs allows one to see the number of dictionary equivalents that coincide with either multiple or single occurences of corpus TEs. As can be seen from the table above, four abstract nouns, namely, consciousness, brain, thought, and understanding are predominantly translated with the help of their dictionary equivalents. Their non-dictionary correspondences belong to the type of translation transformations that are irregular and quite accidental. Knowledge occupies an intermediate position since its non-dictionary equivalents are contextual correspondences synonymous to the dictionary equivalents. In the case of the translation of mind, few of the equivalents listed in the bilingual dictionary were found in the translated sentences. The analysis of one novel parallel corpus has shown that dictionary TEs of mind are neither numerous (only three nouns out of ten) nor very frequent. Mind is a typical specimen of the so-called ‘rich translation equivalence’ as opposed to ‘basic translation equivalence’ usually recorded in the dictionaries



JB[v.20020404] Prn:15/02/2007; 13:55



F: BCT805.tex / p.8 (512-533)

R¯uta Marcinkeviˇcien˙e

Table 1. Corpus occurences lexical items

total

multiple occurences

single occurences

omissions

consciousness

22

sąmonė 16

sąmoningumas, mintis, įprotis, pajautimas, suvokimas

1

brain

16

smegenys 12, protas 2

korifėjus, galva

0

knowledge

23

žinojimas (žinoti) 7, žinios 4, išmanymas 2, mokėjimas 2

orientuotis, patylėti, patirtis, sąmonė, supratimas, suvokimas

2

mind

99

sąmonė 39, galva 14, mintis 13, protas 5, vaizduotė 2

atmintis, mąstysena, pasaulėži¯ura, smegenys, širdis

12

thought

144

mintis 113, mąstymas 13, galvoti 10

manyti, dingtelėti, prisiminti, užsimiršti

4

understanding

11

supratimas (suprasti) 9

neišmanymas

1

(Dickens and Salkie 1996: 553). Rich translation equivalence implies a wider range of translation strategies in comparison with those few that are given in bilingual dictionaries. In our opinion, the difference between dictionary and corpus TE reveals itself not only in quantity but also in quality, that is, the type of TEs. If transformational correspondences prevail among non-dictionary TEs, they can be safely excluded as accidental. Such a SL lexical item under consideration is said to belong to the group of ‘basic translation equivalence’. Contextual correspondences, dominating numerous non-dictionary TEs, are, however, a sign of rich translation equivalence and demand special attention. It is useful to resort to the linguistic theory of translation and typology of TEs. Thus it can be concluded that the discrepancy between bilingual dictionaries and corpus data appears only in the case of lexical items which, according to Dickens and Salkie (op. cit. 555), deserve special treatment since they exhibit a wider range of TEs than the one given in a dictionary. This special treatment in a multilingual DB could consist of a long list of collocations of the SL containing mind, for example, that would illustrate multiple choices for translation. In other words, greater prominence should be given to the reflection of syntagmatic relations wherever necessary and possible.

. Concluding remarks Muldi has been designed to combine multilingual lexicons, lists of translation equivalents, terminological records, and both monolingual and parallel corpora.

JB[v.20020404] Prn:15/02/2007; 13:55

F: BCT805.tex / p.9 (533-618)

Syntagmatic dimension in the multilingual lexical database

Incorporation of corpora that are used for extraction of evidence is the most conspicuous innovation of this tool. Exploitation of corpora can help bilingual and multilingual lexicography to reach a level of quality that was not possible without corpora. Muldi can be useful because it is open-ended, that is, any of its components can be updated and extended. Parts of it are quite autonomous and can also be used independently of the status and quality of the total product. It is very convenient to be able to restrict oneself to those parts of the DB that are readily filled with a considerable amount of data. One can take one language lexicon, supplemented with a monolingual corpus, or two or more related lists of equivalents and a parallel corpus, or parallel corpora can be searched for equivalents ignoring the TEs given in the lists. Finally, terminological records can be used as electronic dictionaries without switching over to lists or corpora. Thus, Muldi can be viewed as both an interim product for a lexicographer or translator, that is, a half-way dictionary, and a product in itself.

References Alverson, H. 1994. Semantics and Experience. Universal Metaphors of Time in English, Mandarin, Hindi, and Sesotho. Baltimore and London: The John Hopkins University Press. Dickens, A. and R. Salkie. 1996. “Comparing Bilingual Dictionaries with a Parallel Corpus.” EURALEX ’96 Proceedings, ed. by M. Gellerstam et al., 551–559. Göteborg: University of Göteborg. Hanks, P. 1998. “Enthusiasm and Condescension.” EURALEX ’98 Proceedings, ed. by T. Fontenelle et al., 151–166. Liège: University of Liège. Jarošová, A. 1997. “Parallel Corpora and Equivalents not Found in Bilingual Dictionaries.” Proceedings of the Second European Seminar “Language Applications for a Multilingual Europe,” ed. by R¯uta Marcinkeviˇcien˙e and Norbert Volz, 69–76. Mannheim/Kaunas: IDS/VDU. Koptjevskaja-Tamm, M. 1989. “Linguistic Translation Theory in the Soviet Union (1950– 1980’s): a Review.” Reports from the Institute for Interpretation and Translation Studies, University of Stockholm. Stockholm: University of Stockholm. Kövecses, Z. 1986. Metaphors of Anger, Pride, and Love. A Lexical Appraoch to the Structure of Concepts. Amsterdam/Philadelphia: John Benjamins. Lakoff, G. and M. Johnson. 1980. Metaphors We Live By. Chicago/London: The University of Chicago Press. Recker, J. I. 1974. Teorija perevoda i perevodˇceskaja praktika. Moscow: Meždunarodnyje otnošenija. Sager, J. C. 1990. A Practical Course in Terminology Processing. Amsterdam/ Philadelphia. Sinclair, J. 1996a. “The Empty Lexicon.” International Journal of Corpus Linguistics 1(1): 99–120. Sinclair, J. 1996b. “An International Project in Multilingual Lexicography.” International Journal of Lexicography 9(3): 177–196. Sinclair, J. 1999. “The Computer, the Corpus and the Theory of Language.” Lingua 1(99): 24–32.



JB[v.20020404] Prn:15/02/2007; 13:55



F: BCT805.tex / p.10 (618-627)

R¯uta Marcinkeviˇcien˙e

Teubert, W. 1996. “Comparable or Parallel Corpora?.” International Journal of Lexicography 9(3): 238–265. Tognini-Bonelli, E. 1996. “Towards Translation Equivalence from a Corpus Linguistics Perspective.” International Journal of Lexicography 9(3): 195–217.

JB[v.20020404] Prn:15/02/2007; 13:58

F: BCT806.tex / p.1 (47-111)

Compiling parallel text corpora Towards automation of routine procedures Mihail Mihailov and Hannu Tommola The aim of the research project running at the Department of Translation Studies of the University of Tampere is to collect a Russian-Finnish parallel corpus of fiction. The corpus will be equipped with efficient search and analysis tools. The texts of the corpus will be stored as ordinary text files. Each text will be registered in a Microsoft Access database and supplied with a description. Automated parallel concordancing is being developed for the corpus. The program will find the keywords in text A (Russian), then look for possible translation equivalents of the keywords in language B (Finnish), and then search for the portion of text B (Finnish) where most of the keywords in question can be found.

General It is difficult to surprise anyone with a text corpus: corpus research has “come of age” (Svartvik 1992). There are text corpora of various sizes and types being developed which represent one language, a pair of languages, a certain type of discourse, the history of language, spoken language, etc. Some corpora of the English language are already arriving at the 500 million word benchmark, and one billion words in a corpus does not sound fantastic any more. Text corpora are now being developed not only for the major European languages (English, German, French) but also for less widespread languages (Swedish, Norwegian, Finnish). Since the 60’s and 70’s text corpora development has become much easier. Electronic texts in a large variety of languages can be obtained on the Internet; during the last ten years, scanning and OCR technologies have been considerably improved; new hardware is much faster; software is more user-friendly and more reliable, and memory resources are incomparable with the “good old days” when an IBM PC AT 286 with 1 MB RAM and 40 MB hard disk was considered a powerful and fast computer. Associations like TELRI or ELRA are helping linguists from various countries to unite their efforts in collecting language resources in electronic form. However, developing corpora of the spoken language continues to be a problem. Because speech recognition systems still remain experimental

JB[v.20020404] Prn:15/02/2007; 13:58



F: BCT806.tex / p.2 (111-168)

Mihail Mihailov and Hannu Tommola

prototypes, most of the work in compiling these corpora is done manually. But even here things look much better: there is now enough capacity on disks for huge sound files; the recording quality is much better, etc. The number of corpus-based projects is rapidly growing while the number of scholars who are skeptical about this innovation is declining at the same speed. Text corpora are being used in most current lexicographic projects. Applied linguistic research is another field where text corpora are welcome as an inexhaustible source of empirical information, a polygon for testing various linguistic tools – spell-checkers, OCRs, machine translation systems, NLP systems, etc. At the same time, corpora are also quite useful for theoretical, “armchair linguistics” (Fillmore 1992) as well. Although text corpora are also helpful in language instruction and applied linguistics in general, the main users of text corpora are lexicographers. For compiling dictionaries, researchers need much empirical data in order to build word lists and to collect examples. The data collected from a large corpus is more reliable than what can be obtained from dictionaries. Even if dictionaries are used as a source, it would be much safer to consult a text corpus as well in order to detect new, unregistered meanings, examples of usage, etc. While text corpora are used quite widely for compiling monolingual dictionaries, it is still a problem to use text corpora in bilingual lexicography. Of course it is possible to use two separate corpora, but it would be useful to have parallel texts and tools for looking up words and their translations in parallel contexts.

The project The aim of the ParRus research project running at the Department of Translation Studies of the University of Tampere is to collect a bilingual corpus of parallel texts (Russian and Finnish, initially). The texts will be Russian classical fiction texts and their translations into Finnish. The corpus will not be very large (not necessarily more than about five million running words), but it will be equipped with efficient search tools for the analysis of parallel texts. At present we have a substantial corpus of Russian prose (4.5 million words), and we have started to collect their Finnish translations and to modify the software for running the parallel text corpus. The above mentioned text corpus of Russian prose has been equipped with a set of tools for building word lists and concordances.

JB[v.20020404] Prn:15/02/2007; 13:58

F: BCT806.tex / p.3 (168-224)

Compiling parallel text corpora

Structure of the corpus The present task is to collect Russian fiction texts and their Finnish translations. As a result, we shall have authentic Russian texts (with “normal” Russian language) and translated Finnish texts. The language of the translations will probably be influenced by the Russian originals, that is, it will be different from authentic Finnish prose: the translators might have been under the influence of the language from which they were translating. Therefore, many linguists do not consider translations as a source of “linguistic evidence”. Grammatical forms, syntactic patterns, and word frequencies in the Russian subcorpus will be more or less representative of standard Russian language. The same will not necessarily be true of the Finnish subcorpus. The grammar, sentence structure, and vocabulary of the translations may be influenced by the source text. This means that the corpus will be “asymmetrical”, centered on the Russian language. It would have been interesting to make an attempt to develop a “symmetrical” corpus – Russian-Finnish + Finnish-Russian. Such a corpus would make it possible to check if a phenomenon found in translations is valid for original prose and vice versa. However, such a task appears to be highly complicated for many reasons. It is possible to make a representative corpus of Russian fiction texts and to “mirror” it in Finnish, because the major Russian works of fiction – both classical and modern – have been translated into Finnish. Some books have even been translated several times. The situation with Finnish literature is quite different. Not many Finnish authors have been translated into Russian, and, unfortunately, sometimes these translations are of poor quality. Some important works have never been translated. Today the situation is changing. Many more translations are being published in Russia, among them some from Finnish. For example, during the last five years, many novels by Mika Waltari (who was considered anti-Soviet in the times of the USSR) have been translated into Russian. Still it seems that a Finnish-Russian corpus would be more difficult to collect; it would be much smaller and quite different in structure, and it would be difficult to align with the Russian-Finnish Corpus. As a point of interest, there might be similar problems in compiling an English-Russian + Russian-English parallel corpus: there are many more translations from English into Russian than vice versa; cf. also the Norwegian experience in extending the bilingual English-Norwegian Parallel Corpus (http://www.hf.uio.no/iba/prosjekt/ENPCmanual.html).



JB[v.20020404] Prn:15/02/2007; 13:58



F: BCT806.tex / p.4 (224-278)

Mihail Mihailov and Hannu Tommola

The principles of text collecting The main principles of collecting electronic texts for a text corpus of fiction are different from those for a text corpus of language or a text corpus of mass-media. Entire texts must be collected rather than extracts. The main idea of sampling is to achieve proportional representation of different kinds of language. In this case, the statistical information acquired from the corpus would be more or less reliable. However, it cannot be 100% reliable, because it is impossible to represent everything that occurs in the language. The main argument against using sampling in fiction corpora is that the users – notably in translation studies – might be interested in specific authors rather than in the facts concerning the language per se (besides, fiction is not “the whole language,” anyway). So it is possible to make parts of the corpora representative, while it is quite impossible to achieve representativeness for the entire corpus. Only approximate balance can be reached. Complete balance is impossible, because novels are longer than short stories; there were periods when numerous important works of fiction were published, and periods when short stories were preferred to novels, etc.

Maintenance of the corpus The basic idea in developing ParRus is to separate the texts from the tags. Usually the corpus software is “anti-intellectual” – all the program can do is to find strings of characters, display them, or perform calculations on them. So, corpora developers have to make explicit all relevant information. This approach seems to have several drawbacks. The two most important are as follows: 1) The texts become unreadable because of the tags. 2) Changes in tagged corpora are always a problem. In the ParRus Corpus the texts are “clean”. They are stored as ordinary text files. All relevant information is registered in a Microsoft Access database. The database is used for data processing as well. Users can obtain concordances for specified word(s) or word combination(s). They can also use the word list for querymaking. It is easy enough to specify context size (in sentences) and comparison mode for the main and the second search key (whole word/start of word/end of word/any part of word) as well as the second search key position (same sentence/next word). The first version of the software was designed for a monolingual corpus. However, it was not difficult to adapt it to a bilingual corpus.

JB[v.20020404] Prn:15/02/2007; 13:58

F: BCT806.tex / p.5 (278-278)

Compiling parallel text corpora

Figure 1. Form for registering and indexing original texts and translations

Figure 2. The Search interface



JB[v.20020404] Prn:15/02/2007; 13:58



F: BCT806.tex / p.6 (278-332)

Mihail Mihailov and Hannu Tommola

While the approach for corpus compiling that we are using is in itself complete, it could still be extended further. We are planning to add lemmatizing routines to the program which will make it possible to build another index – a grammatical one. This will make searches for grammar forms possible as well.

Parallel concordancing Perhaps the most difficult and the most interesting part of the project will be to find out if automated parallel concordancing is possible. The starting point was the idea that although translators change much in the translation in comparison to the original text – they may join or split sentences, change clauses into phrases, omit or add words, use broader or narrower equivalents – still they translate something literally. All words cannot be skipped or translated with an unexpected equivalent, otherwise we would get an entirely new text. The words which in most cases are assumed to be translated literally we shall call keywords. So, we presume that if equivalents for more than half of the keywords of extract A from the original were found in extract B of the comparable size from the translation, then extract B is likely to be the translation of extract A. What word classes shall be keywords? We may have to exclude prepositions, conjunctions, pronouns, etc. We also have to reject words with very broad meanings (e.g. idti “to go”). Some words are frequently used as parts of idioms and are therefore unpredictable in translations (bog “god”). From what is left we also have to exclude words having high-frequency homonyms. For example, we cannot include the Russian noun peˇcen’ “liver” into the Russian-Finnish glossary of keywords, because its Finnish equivalent (maksa) is homonymous in many forms with the verb maksaa “to pay”. Another criterion is word frequency. First, very frequently used words may cause problems because they are everywhere. Second, words occurring only once also have to be excluded: there are too many of them. The most useful words for our research are those with a frequency in the range of 2 to 6. Most of these words and some of the more frequently used words will be included in the list of keywords. Together with Finnish dictionary equivalents, they will form the core of the system.

The system will work as follows: 1) Extract A from the original will be split into words. 2) The keywords will be selected, and the weight of the sample will be calculated. (Each keyword will be assigned a weight of 1 to 3, depending on frequency; the less frequent the word, the more the weight. The weight of the sample will equal the sum of the weights of the keywords).

JB[v.20020404] Prn:15/02/2007; 13:58

F: BCT806.tex / p.7 (332-371)

Compiling parallel text corpora

Text A

Text B

Search for coocurrence of keywords in Text B

Morphological analysis of text A

Keywords A

Keywords B

Keywords dictionary

Figure 3. Search for parallel contexts

3) The Finnish equivalents B1 , B2 , . . . , Bn for the keywords will be looked up. 4) The contexts for each keyword will be looked up and checked against other keywords. For each context, the weight will be calculated. If the weight of context Bx is more than 60% of the weight of extract A, Bx will be considered a translation of A and will be presented to the user. If our hypothesis is true, the program will be able to find parallel places if a) the context is long enough (at present we cannot tell what “long enough” means); b) enough keywords are found; c) the translation is close enough to the original.

Experiments To find out whether the proposed approach is valid, three experiments were conducted.

Experiment 1 A one-page extract from Dostoyevski’s Zapiski iz podpol’ja (Notes from the Underground) and the corresponding part of the Finnish translation by Esa Adrian were taken. A list of pseudostems (e.g., sairas, sairaa- and sairai- for Finn. sairas “ill, sick” to find also all inflected forms, as sairasta, sairaan, sairaita, etc.) for



JB[v.20020404] Prn:15/02/2007; 13:58



F: BCT806.tex / p.8 (371-395)

Mihail Mihailov and Hannu Tommola

Figure 4. A sample keyword list

Figure 5. Parallel concordance

keywords and their Finnish equivalents was compiled manually (see Figure 4). The program was tested with quite satisfactory results (see Figure 5).

Experiment 2 A search for parallel positions for the phrases from the analyzed Russian extract was performed on the entire text of two Finnish translations (by Adrian and by Valto Kallama). The results did not change.

JB[v.20020404] Prn:15/02/2007; 13:58

F: BCT806.tex / p.9 (395-446)

Compiling parallel text corpora

9%

3% 2% 1 occurrence 2 to 6 occurrencies 51 %

35 %

7 to 20 occurrencies 21 to 60 occurrencies More

Figure 6. Word Frequencies in Dostoyevski’s “Notes from the Underground”

Experiment 3 A keyword dictionary was compiled. It was based on medium frequency words (more than 6 occurrences). However, as expected, the results of searches based on such a small word list (about 200 keywords) appeared to be unreliable. In most cases, nothing was found. However, no incorrect matches were produced, either. This failure can be easily explained by the frequency distribution of the words (see Figure 6). The words used for compiling the keywords dictionary were only a bit more than 10% of the entire word list. To make the keyword search reliable, keywords should be chosen from among words with frequencies from 2 to 20, which would comprise about 40% of the total word list.

Applications of the parallel text corpus The parallel text corpus would be useful in the fields of contrastive studies, translation studies, and bilingual lexicography. It would make it possible to check how a word is actually translated, which is sometimes quite different from what is expected according to the dictionaries. It will be easy to find translations of quotations. It would also be quite possible to monitor usage of certain grammatical forms or constructions and ways of translating them into another language.

References Fillmore, C. 1992. “ ‘Corpus linguistics’ or ‘Computer-aided armchair linguistics’. ” In Svartvik (ed.). (1992): 35–60. Svartvik, J. 1992. “Corpus linguistics comes of age.” In Svartvik (ed.). (1992): 7–13. Svartvik, J. (ed.). 1992. Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82. Stockholm, 4–8 August 1991. (= Trends in linguistics 65). Berlin: de Gruyter. The English-Norwegian Parallel Corpus. http://www.hf.uio.no/iba/ prosjekt/



JB[v.20020404] Prn:31/05/2007; 9:28

F: BCT807.tex / p.1 (46-151)

Data-derived multilingual lexicons John McH. Sinclair This paper first appeared in Arcaini (ed.) 2000: La Traduzione (IV). Quaderni di Libre e Riviste d’Italia, 43; Roma: Ministerio per i bene e le attività culturali. For this publication it has been lightly revised, and the bibliography updated.

This is an interim report on a group of research and development initiatives that are all heading in the same direction, but taking different routes. Their origins are varied, and they are of different vintages. The common aim is the development of multilingual lexicons which can claim to be “data-derived”, in one way or another. Each has a set of unique features to offer, and each will develop along a different timescale and use different resources. Eventually, the three may be harmonised in a single approach which will provide the basis of a powerful, semi-permanent lexicon (but the role of such a device in applications must be carefully defined and circumscribed – see the concluding comments below). Everyone who works with machine-oriented lexicons is aware of their serious problems of credibility. Very expensive to compile, their performance in applications is dismal, so dismal that their fundamental strategies must be questioned and their accountability must be improved. By “accountability” is meant their ability to provide a satisfactory analysis of open text. There are many schemes which are attractive conceptually but whose utility cannot be assessed because, for one reason or another, they cannot be put to work on a range of sample text material. Mostly this unfortunate position arises because they are unfinished – either in breadth of coverage of the vocabulary of a language or in the absence of a working link between the abstractions and the realisations. It has long been normal practice to develop lexicons in the absence of data verification, but there are now signs that evidence from text corpora will be incorporated in the lexicons of the future. This is a step forward, welcome despite the posturing that sometimes goes with it,1 and I will return to it later in the paper. . Scholars who until recently have preferred to ignore the large text corpora that have been available for many years find it hard to perform the necessary volte-face with dignity. For

JB[v.20020404] Prn:31/05/2007; 9:28



F: BCT807.tex / p.2 (151-189)

John McH. Sinclair

The three relevant lines of work that form the group of my interests are: 1) translating monolingual dictionaries: The Bridge Dictionary Project 2) analysing parallel corpora in different languages: the TELRI project and the Bonomia corpus 3) analysing comparable corpora in different languages: the Council of Europe Project I will briefly describe these three ventures and then relate them to the aims of improving multilingual language reference resources.

.

The Bridge Dictionary Project

This is a means of exploiting an existing resource that is already quite sophisticated. The Cobuild range of dictionaries have two unique features relevant to this argument – they are “corpus-driven” in that unlike all other dictionaries to date they attempt to make the entries and the dictionary structure reflect the evidence found in large text corpora, and the definitions are written in ordinary English sentences (Sinclair 1987). The first of these features guarantees that explicit links are maintained between the corpus and the definitions, examples, etc.; since the links are managed by lexicographers, they contain a wealth of human judgement, and so are not formal links, but they embody a very large amount of careful observation and assessment of corpus evidence. The second feature of a Cobuild dictionary, full-sentence definitions, makes them readily translatable, as against the fragments, abbreviations, and arcane conventions of traditional lexicography. A translatable dictionary such as Cobuild is related to data via a large amount of human endeavour (approx. one person-century, in this case) and is thus more readily available, semi-digested, for people or computers. It is, however, only example, the organisers of a workshop at a large NLP conference in Granada, Spain in May 1998 put in the blurb an attempt to rewrite history as follows: The divergence of corpus usages from lexical norms has been studied computationally at least since the late Sixties, but only recently has the availability of large on-line corpora made it possible to establish methods to cope systematically with this problem.

This is a strange statement, since corpora of adequate size have been available since the late sixties. Corpus usage has been studied since the early sixties (e.g. Sinclair et al. 1970). However the studies of divergence mentioned above (without references) made no detectable impact on the concepts and methodology of lexicon building. An emerging branch of research is now involved in studies and experiments on corpus-driven linguistics.

Corpus-driven linguistics is not emerging; it has been flourishing for many years, but was systematically ignored until recently by the mainstream of computational linguistics.

JB[v.20020404] Prn:31/05/2007; 9:28

F: BCT807.tex / p.3 (189-244)

Data-derived multilingual lexicons

indirectly related to language data; the effect of human intervention is not open to scientific investigation and so puts a ceiling, albeit a high one, on the ultimate applicability of the work to the elaboration of formal systems. A one-volume dictionary that claims coverage of a whole language has to compress and prioritise a great deal, and at present, it is a process that has to be carried out by highly-trained lexicographers. Later dictionaries from Cobuild compressed and selected still more, until in 1990 the Students Dictionary was published, carrying about 31 000 senses. It is fair to assume that this book contains the elements of a “core vocabulary” for English. In these dictionaries, in almost every case, each definition is one sentence; very occasionally the information is spread over two sentences to avoid complex constructions, because this small dictionary was produced for learners who might have very little English at the start. The sentence form was devised, against a centuries-old tradition, because of the perceived needs of learners. Whereas a native speaker may manage to decode the unusual language of normal dictionary definitions (though it is suspected that many do not), a user who is in the process of learning the language in question will have great difficulty; but if the definitions are presented in fairly simple sentences (and without the special terminology and phraseology that they usually contain), the foreign learner can use his or her knowledge of the language acquired so far and will have at least a better chance of success. At the time these decisions were being taken, there was no thought of translating the dictionaries; they were compiled as monolingual works. Soon after first publication in 1987, however, interest grew in the application of corpus methodology to bilingual lexicography. Also, in the rapidly expanding field of information science, teams began to experiment with the automatic extraction of linguistic information from dictionaries (Fontenelle 1995). The full-sentence definitions proved to be particularly useful in this work. The sentence is a safe unit for translation; it may not be the smallest viable unit nor the ideal one, but it is a unit that is familiar to translators, whether human or mechanical, and it has a predictable value in both the source and target languages. A dictionary whose definitions are written in ordinary sentences of the language has a head start in translation over one whose definitions are less explicit and less conformant to the norms of the general language. Thus the idea was born in 1988 of making a new kind of dictionary, neither a monolingual one (since the definitions would be in the source language of the user) nor a bilingual one (since there would be only one headword list, that of the target language). The language chosen was Brazilian Portuguese, and the pioneering work was done by members of staff of the Cultura Inglesa in Rio de Janeiro.



JB[v.20020404] Prn:31/05/2007; 9:28



F: BCT807.tex / p.4 (244-293)

John McH. Sinclair

The rationale and methodology is discussed in Baker and Kaplan (1994), and the dictionary was called a “Bridge” dictionary. It kept the explanatory quality of a monolingual dictionary but made the definitions accessible to an early learner; only rarely was a translation equivalent, the mainstay of the bilingual dictionary, used; in fact only when it appeared that the two languages were so close in their semantics and usage that an explanation would amount to pure verbosity. Shortly after this project began, it was pointed out by Prof. Helmut Schnelle that the full-sentence definitions had another quality besides translatability – they were in a form that was capable of logical regimentation (Schnelle 1995), and therefore formalisation. In principle, from such dictionaries, lexicons could be derived automatically. Compilers had been given no restrictions on their use of language in the definitions, but were given a lot of guidance about the needs of the user (the notion of a pre-determined wordlist that definitions were confined to was rejected because the results of compiling under such restrictions were felt to be unsatisfactory). The output of this process is a sublanguage which is parseable by a simple parser (Barnbrook and Sinclair 1994). At this time, the normal lexicons for machine applications were still regarded by many as capable, with development, of giving a satisfactory performance for unrestricted manifestations of languages; but they were extremely labour intensive, and so were expensive and lengthy to undertake. The prospect of deriving information from a Bridge dictionary was attractive, and a feasibility study to examine this was commissioned by the EU (Sinclair et al. 1994). That study established that an automatic procedure could be devised that would derive lexical information from the full-sentence definitions and express the information as formal statements. Meanwhile other teams were beginning to work on Bridge translations, and the prospect arose of multilingual work of a type that has since been called the “hub and spoke” model (Martin 1998). In this model, one language is designated the “hub”, and the other languages all relate bilingually to the hub language. So two languages on “spokes” are related indirectly to each other via the hub. For the Bridge project, the hub language had to be English simply because the original headword list was worked out for English, and not for any other feature of the language. The entry condition for entering the Bridge project is to provide translation equivalents (TEs) in a new language for the 31 000 senses of the original dictionary. These can immediately be matched with definitions in any of the participating languages, providing an instant one-way bilingual dictionary. For example, a list of TEs in Italian can be linked, through the English headword list, to the existing definitions in Portuguese, and a Brazilian will have a serviceable small Portuguese-Italian dictionary for negligible effort; a similar “instant” dictionary will be available to other participating languages. If the Italian team then offers

JB[v.20020404] Prn:31/05/2007; 9:28

F: BCT807.tex / p.5 (293-348)

Data-derived multilingual lexicons

translations of the definitions in Italian, then the other half of the bilingual, the Italian-Portuguese half, can be derived, using the Portuguese TEs and the Italian headwords. It can be seen from this example that as the number of participating languages grows, a synergy is created of great potential. If Italian were the twentieth language to join, then nineteen new bilingual dictionaries would be derivable for the cost of a brief edit. If Italian were the hundredth language to join . . . The prospect is very appealing, having almost the charm of getting something for nothing. There are some problems, but none as yet that threatens the logic of the project. Of the problems, some are a consequence of having English as the hub. Whatever language found itself in that position there would be a cultural bias introduced, and the design for development of the Bridge project contains provisions for extending the headword list as corpus information becomes available for other languages in order to achieve a better balance. Another problem concerns the examples, which are taken straight from The Bank of English, the corpus at Birmingham. For some languages, there is pressure to provide translations of these, because some may be difficult for early learners; but whether these translations can be reliably used in language pairs where English is not one of the languages is very doubtful. A superior, but more costly procedure would be for each language to provide examples for its own headword list. Other problems concern the commercial pressures that affect the uniformity of the bridge concept. At the beginning, it was hoped that the prospect of cheap dictionaries, especially for language pairs that had little independent commercial appeal, would ensure that the minimum conformity would be maintained; but market pressures are dictating otherwise in some markets, for example, in countries which have a high standard of English in the community, the smallest Cobuild dictionary is considered too small, and parallel projects are translating the larger books, leaving a substantial editing job to match the 31 000 definitions. Also Cobuild has recently published a new edition of the Students Dictionary, which does not exactly match the original, and new publishing projects are obviously attracted to the latest version – it does not seem to make commercial sense to maintain consistency with an older text unless one can look ahead to the benefits that would accrue eventually. Despite these and other problems, work goes ahead in several languages; the Bridge project is a recognised activity of TELRI (Trans-European Language Resources Infrastructure), and progress reports appear regularly in TELRI publications (see Bibliography). At present, the aim is to achieve publication of a number of translations, after which the next stage will be to study the operation of the hub-and-spoke model to achieve synergy. All this effort is within the area of paper dictionaries for humans. In a parallel development, it is planned to return to the question of making multilingual



JB[v.20020404] Prn:31/05/2007; 9:28



F: BCT807.tex / p.6 (348-400)

John McH. Sinclair

machine lexicons, building on the earlier project. Again, the principle is straightforward, because if all participants keep closely to the job of translating the definitions, then (a) parallel parsers for the language of definition in other languages would be fairly easy to write, and (b) the 31 000 senses would need only slight adaptation for each language. Here, then, is a simple way of building basic formal lexicons for an unlimited set of languages, able to absorb additional material derived from corpus research. It will fit into familiar structures such as HPSG (head-driven phrase structure) and TFS (type-feature structure) but does not require all the formal complexity that they generate. As a stand-alone system, it is worth building and testing, but it also has a role as one of the three linked ventures outlined in this paper. The co-ordination of these will be dealt with later.

. Analysing parallel corpora: TELRI and Bonomia The second line of research involves a combination of parallel corpora and bilingual dictionaries, providing a powerful platform for lexical research. Parallel corpora are corpora related to each other through the translation process; either all except one are translations of the one, or all are translations of an original which is not represented. The corpora are analysed in a very specific way when the goal of analysis is to retrieve information like translation equivalence; the analysis tries to match, or align, segments of parallel texts so that it becomes fairly easy to isolate a word or phrase in each of the texts which are likely to be translations of each other. Ten years of work has produced a number of aligners, software programs that attempt to align parallel segments, but there are still many problems to be overcome before the techniques realise their potential.2 Recently, Teubert (1996) pointed out that the deficiencies of aligners might be overcome by combining them with simple bilingual dictionaries, which have complementary deficiencies. A bilingual dictionary which merely offers a number of possible TEs without comment is not very helpful in normal use, but coupled to an aligner it might improve the speed and accuracy with which the system isolates the particular translation equivalent for the context. Most alignment software is founded on the hope that translators maintain the same sentence and paragraph boundaries in their translations; this was a reasonable hope with the early parallel material that came to hand – the Proceedings of the Canadian Parliament, in French and English – but it cannot be generally . An up-to-date survey is to be found in Veronis (ed.) 2000.

JB[v.20020404] Prn:31/05/2007; 9:28

F: BCT807.tex / p.7 (400-443)

Data-derived multilingual lexicons

assumed. The stylistic conventions and norms of different languages affect such matters as the length and complexity of sentences and paragraphs, and within languages, the conventions of different genres provide further reasons for varying the position of sentence and paragraph boundaries. For example, the meaning of a sentence boundary in legal language is inherently doubtful, and some styles of legal writing make as few sentence breaks as possible.3 When the University of Bologna launched its project to build a parallel corpus of Italian and English legal texts (Favretti 1998), it faced the task of marking up a large amount of text – over 20 million words in the first instance – with sentence and paragraph boundaries by hand, ensuring that they fell at equivalent points in the two texts. Such a task was plainly a ridiculous waste of time and effort, and would permanently hamper development of the corpus. At this time, the TELRI project was reviewing the whole area of parallel corpora and alignment software, with the object of providing a multilingual research platform for all the languages of Europe; the progress of this work can be traced through the TELRI Newsletters and Seminar Reports, marked by the publication of a CD-ROM (Erjavec, Lawson, and Romary 1997). This work included an extensive survey of alignment software, including demonstrations of most available packages; it also provided opportunities for new proposals for software to manage parallel corpora and for a number of novel studies of parallel texts. If sentence and paragraph boundaries cannot be relied upon, what can take their place? The search is on for alternative linguistic features, which may not be as vulnerable to alteration during translation, and some bundle of which may provide a sufficiently detailed and reliable dissection of the corpora for individual words and phrases to be compared. This raises the question of why one should align parallel corpora in the first place; it has been taken for granted for a decade that alignment is the only agenda. Returning to the beginnings, parallel corpora were seen as a resource because they contained the results of the translation process; and if this information could be retrieved, there would be great benefit for the understanding of translation and the automation of it. Each act of translation is interesting in itself and worth study, and there is a substantial literature of commentary on individual acts. The potential of parallel corpora was to be able to examine many instances of related acts, for example, where the same word or phrase in one language is translated differently on different occasions. Alignment was the first answer to the question of how, . This discussion takes no account of spoken transcriptions, where sentence and paragraph boundaries are added in transcription, and where change of speaker and pauses are the only likely boundaries in the sound wave.



JB[v.20020404] Prn:31/05/2007; 9:28



F: BCT807.tex / p.8 (443-498)

John McH. Sinclair

given millions of words in each language, stretches of text that were translations of each other might be isolated. Assuming that the very simple, recurrent and recognisable signals of sentence and paragraph boundaries survived the translation process, parallel texts could be marked up in such a way as to identify parallel sentences, within which it could be shown that individual TEs could be determined with a good chance of success. At this time, corpora were still fairly small, and a lot of repetitive human effort went into all kinds of annotation and mark-up, so it was understandable that the methodology of alignment was adopted for translations which were not as strictly parallel at sentence and paragraph level than the original material – it only required some “preprocessing”, which was fashionable anyway. No-one seemed to stop and consider that the preprocessing of corpora for alignment was in fact – alignment. Nowadays, all preprocessing is suspect, because it is normally a euphemism hiding a lot of low-level human labour that shores up inadequate software; but a distinction can be made. Automatic parsing, for example, often requires that its input takes the form of a string of well-formed sentences, and therefore most naturally occurring text has to be tidied up in manual preprocessing; since the two activities of assigning sentence boundaries and parsing are significantly different, this is an understandable, if not a desirable, procedure. But in alignment the manual and the automatic processes have identical goals; so it is clear that the one just covers for inadequacies in the other. There is an alternative way of seeing the problem. Given that the aim is to find expressions in parallel texts which are translations of each other, it may not be necessary to align the texts at all, but rather to pursue each TE as it is required. Such a strategy is in line with a general trend – rather than prepare analyses in advance, which with slow computers and poor software was necessary, it is regarded as preferable to devise software which will do its analysing in real time and only needs to be applied when and where it is found necessary. Let us assume for a moment that a text and its translation are exactly the same length in words, although this does not happen in life. If word w in text A is at position n, whereabouts is its translation likely to be in text B? The intuitive answer is it will not be far from position n in that text, and that it is likely to be in the same sentence as the word at position n. So by just counting along the texts, we should be able to get near to TEs. There are snags. First of all, the text and its translation will not be the same length; the difference in length, counted in words, can be substantial. But it is a simple matter to relate texts of different word length together and for each position n in one text find position n in the other. Also, position n may be at or close to a sentence boundary, and just possibly the TE may lie in an adjacent sentence; it may be preferable to open a window

JB[v.20020404] Prn:31/05/2007; 9:28

F: BCT807.tex / p.9 (498-548)

Data-derived multilingual lexicons

measured in words without taking sentence boundaries into consideration; with a little piloting, the size of the window can be optimised. Once the researcher is reasonably sure that a short stretch of text B contains the TE of word w, then a straightforward link can often be made using a simple bilingual dictionary, regardless of whether or not it offers several senses, or several TEs per sense, or both; the chances of another word in the window matching any of the TEs offered are very low. If there is no match found, then the problem is no different from that faced by an aligner – such variables as pronominal substitutions and word-to-phrase and phrase-to-word translations have to be taken into account. The provisional name for this approach to harnessing the information in parallel corpora is FYP – Find Your Place. It is intended solely for the retrieval of records of translation decisions for use in other procedures, although it may be helpful for other applications, or to provide confirmation of alignment in difficult cases. The Bonomia Corpus Project has as an early aim the examination of the translation of words and phrases for key legal terms, because, although standard TEs exist for English words like contract, tax, they are not used in every instance, and contextual features can be found which help to determine the translation favoured. FYP software is being developed for use in this project. There are some reservations about the reliability and value of this approach to acquiring multilingual information. Neither translators nor lexicographers are guaranteed to be consistent, and the bilingual dictionaries available at present are not based on corpus analysis, so there may be gaps and discrepancies that are difficult to sort out. However, the output from this process is non-final, and the weaknesses just transfer some problems elsewhere in the system.

. Analysing comparable corpora: The Council of Europe Project Some ten years ago, the corpus holdings of a number of institutions representing different European languages had become sufficiently large to be used as comparable corpora, in that they were all designed to be samples of the current state of their language. With help from the Council of Europe, exploratory work was started to see how the verbal environment of a word might affect its translation. Seven languages were involved: Croatian, English, German, Hungarian, Italian, and Spanish. Each of the tasks carried out by the project teams involved retrieving a concordance to a number of words and exchanging the concordance for one that purported to be of the nearest TE in another language. Concordances were limited to 150 lines, and a bilingual with translation experience provided a translation



JB[v.20020404] Prn:31/05/2007; 9:28



F: BCT807.tex / p.10 (548-598)

John McH. Sinclair

for each instance of the node word in the concordance. Reports were prepared on patterns found in this process, and the overall conclusion at the end of 1991 was that much of the patterning observed was capable of automatic recognition and therefore might be helpful in machine-assisted translation. In 1994, a small invited seminar in Malvern, England brought together researchers from most of the original teams and some others and reviewed the situation; from this seminar came a detailed report (Sinclair et al. 1996). The implications of this report are that, while largely computationally tractable, the job of programming a computer to recognise and distil relevant patterns from the wide range of realisations is onerous and will require very large amounts of data to be processed. Furthermore, the program will have to learn as it gathers experience in order to reduce gradually the number of cases where the immediate environment does not supply adequate clues to the meaning. Some pilot programming has been done, which tends to confirm the points made above. Of course, when only a few pairs of words have been described in terms of their cross-language environments, it is not possible to assess the synergy that will arise when many words have been so described. But it is not strictly accurate to talk of pairs of words, because in most cases the relationship of a given word w in language A to TEs in language B is one-to-many; any one of these TEs, when examined as a whole, produces a similar pattern in the opposite direction, and so on – hence, a network of translation relationships is rapidly set up. The principal advantage of this method of retrieving multilingual information is that, compared with the other two, much more of the procedure is automatable. It can be made largely independent of the decisions of lexicographers or translators and is replicable and, in the main, “language-independent”, in that it can be tuned to any pair of languages. The serious block to progress in this area is the labour of organising the raw material from which the computer can build equivalence networks.

. Co-ordination of the approaches Of the three approaches that have been presented here, the Bridge dictionaries can provide multilingual lexical information most quickly and cheaply; the drawbacks are likely to be size and coverage. As to size, there are larger Cobuild dictionaries that share the same full-sentence definitions, and at least one of these has been translated (Zettersten 1998); the aspect of size that correlates with the number of headwords can in principle be extended considerably. The problem area is that, no matter how ingenious the phrasing, it is not possible to reflect all the observable patterning in a normal sentence; hence, there will be areas of inexplicitness.

JB[v.20020404] Prn:31/05/2007; 9:28

F: BCT807.tex / p.11 (598-653)

Data-derived multilingual lexicons

On the other hand, when the Bridge information is co-ordinated with the other approaches, it will provide an excellent check on their automated or semiautomated routines. For example, in the FYP strategy for parallel corpora, if there is some doubt in a particular case, reference to the Bridge definitions may solve the problem. The analysis of parallel corpora will provide the bulk of the contextual information on which the multilingual networks will be built. Instead of the labourintensive method of arriving at translations in the Council of Europe project, the FYP software and other processing routines should provide tentative contextual templates which can act as initial hypotheses for the analysis of comparable corpora; where they are confirmed, it can be assumed that any distortion caused by the use of translated texts has been taken into account. It should also be borne in mind that a dictionary is a text in its own right with an elaborate structure; a set of translated dictionaries form a unique kind of parallel corpus. Because of their special properties, the Bridge texts can offer a convenient stepping-stone between “sublanguages” and open text. The properties of the English version include: – – –

being a restricted language for which there exists a comprehensive parser having a large general vocabulary – some 31 000 defined senses having a form for which pilot formalisation procedures have been worked out.

. Conclusion The approaches to multilingual lexical semantics outlined above show both a range of different approaches and an opportunity to bring them together so that they can each benefit from each other. Progress is slow because funding is intermittent in all three ventures, but they offer vital information about languages and translation that originates in the examination of extensive corpus material and is not available from any other source. Surveying the field as a whole, it is clear that progress is held back and operational results are depressed by three serious weaknesses, which will merely be outlined at this time. (1) Monolingual descriptive inadequacies. Bilingual and multilingual services should be built on a framework of sound monolingual descriptions – otherwise the problems are confounded. The low performance of monolingual descriptive systems confronted with the job of analysing ordinary communication is noted with regret. In particular, in the context of this investigation, the lack of identification of operational units of meaning leads to the predicament that translators, whether human or machine, do not have a reliable base



JB[v.20020404] Prn:31/05/2007; 9:28



F: BCT807.tex / p.12 (653-733)

John McH. Sinclair

and so cannot be expected to bear the full blame for poor operational efficiency. In such a situation of defective descriptions and inadequate strategies, the human beings tend to do better, as is the case in translation today. (2) There is a tendency to accord the status of quasi-axioms to positions from earlier theory which are at best unsubstantiated by current standards of investigation – for example, the indispensability of an aligner or the orthodoxy about ambiguity or anaphora in language structure; these impede our ability to derive insights from the data now available. (3) Whatever lexical information is retrieved from these processes and stored in a lexicon, it will never be capable of describing open text. It is now clear that no semi-permanent (“stable”) lexicon is functionally adequate for assigning lexical meaning to units of ordinary text, because of the capacity of text to create syntagmatic meanings that cannot be precisely anticipated and may in some instances contrast sharply with stored lexical information. Another resource must be created in the form of a device which will accept as one input the information from a stable lexicon and will identify text-created meanings as they occur, treating them as a parallel input. It will then reconcile its two inputs and output a lexico-semantic interpretation. The new device can be called a Textual Sense Assigner (TSA). It can be seen that information from the cross-language semantic networks that are created by the third approach can be readily distributed into either the stable or the diagnostic parts of the TSA; in fact, the conventional lexicon appears from this perspective to be little more than a collection of recurrent information which is easier to store than to recreate on each occasion, although it may not all be relevant every time. Experience in practice will tell if it is necessary to have such lexicons at all.

References Bridge Dictionaries Collins Cobuild Student’s Dictionary; Bridge Bilingual Portuguese. 1995. London: HarperCollins. ˇ Anglicko-Ceský Výkladový Slovník. 1998. Nakladatelství Lidové Noviny. Angleško-slovenski slovar BRIDGE. 2000. Ljubljana: DZS. Cobuild New Student’s Dictionary (Thai). 2001. Bangkok: Amarin. Semi-Bridge Engelsk Dansk. 1999. Copenhagen: Politikens Forlag. Cobuild English-Chinese Language Dictionary. 2000. London: HarperCollins. Englannin opiskelijan sanakirja. 2001. Helsinki: Otava. Baker, M. and R. Kaplan. 1994. “Translated! A new breed of bilingual dictionary.” In Babel, Vol. 40, No. 1. Barnbrook, G. and J. Sinclair. 1994. “Parsing Cobuild Entries.” In Sinclair et al.

JB[v.20020404] Prn:31/05/2007; 9:28

F: BCT807.tex / p.13 (733-817)

Data-derived multilingual lexicons

Cobuild Students Dictionary. 1990. London: HarperCollins. Favretti, R. 1998. “Using Multilingual Parallel Corpora for the Analysis of Legal Language: The Bonomia Legal Corpus.” In TELRI-S3. Erjavec, T., A. Lawson, and L. Romary (eds.). 1997. East meets West – A Compendium of Multilingual Resources: Mannheim, The TELRI Association. (CD-ROM). Fontenelle, T. 1995. Turning a Bilingual Dictionary into a Lexical-Semantic Database: Unpublished Doctoral Thesis, University of Liège, Faculty of Philosophy and Letters. Martin, W. 1998. “Linking in bilingual lexicography and the hub-and-spoke model.” Paper given at IDS/TWC Conference on multilingual lexical semantics. Schnelle, H. 1995. “The Logic of Cobuild-type Dictionary-Semantics.” In TEXTUS, Vol. VIII No. 2. Sinclair, J. (ed.). 1987. Looking Up. London, HarperCollins. Sinclair, J., S. Jones, and R. Daley. 1970. English Lexical Studies. London, Office for Scientific and Technical Information (OSTI). (= OSTI Report 5060). Sinclair, J., M. Hoelter, and C. Peters (eds.). 1994. The Languages of Definition: Luxembourg, European Commission. (= Studies in Machine Translation and Natural Language Processing 7). Sinclair, J., J. Payne, C. Hernandez (eds.). 1996. Corpus to Corpus: A study of Translation Equivalence. IJL, Vol. 9, No. 3, Autumn 1996. TELRI-N (Trans-European Language Resources Infrastructure): Newsletter 1ff. (1995ff.). Edited by Eva Hajicová and Barbara Hladka. Prague, Institute of Applied and Formal Mathematics, Charles University, and Mannheim, Institut für Deutsche Sprache. [Electronic version: http://www.telri.de]. TELRI-S1 (Trans-European Language Resources Infrastructure): Proceedings of the First European Seminar “Language Resources for Language Technology,” Tihany, Hungary, September 15th and 16th 1995. Edited by Heike Rettig in collaboration with Julia Pajzs and Gabor Kiss. Budapest and Mannheim 1996. TELRI-S2 (Trans-European Language Resources Infrastructure): Proceedings of the Second European Seminar “Language Applications for a Multilingual Europe,” Kaunas, Lithuania April 17th–20th 1997. Edited by R¯uta Marcinkeviˇcien˙e and Norbert Volz. Kaunas and Mannheim 1997. TELRI-S3 (Trans-European Language Resources Infrastructure): Proceedings of the Third European Seminar “Translation Equivalence,” Montecatini Terme, Italy, October 16th–18th 1997. Edited by Wolfgang Teubert, Elena Tognini Bonelli and Norbert Volz. The Tuscan Word Centre and Mannheim/TELRI Association 1998. Teubert, W. 1996. “Deutsch-Franzöische Verständigung. Ein Übersetzungswerkzeug für das 21. Jahrhundert.” In Sprachreport 2/96, 7–11. Veronis, J. (ed.). 2000. Parallel Text Processing. Dordrecht: Kluwer. Zettersten, A. 1998. “Bridging the Gap: The Danish Connection.” In TELRI-S3.



JB[v.20020404] Prn:15/02/2007; 14:09

F: BCT808.tex / p.1 (46-110)

Bridge dictionaries as bridges between languages Hana Skoumalová

Bridge dictionaries are a new sort of dictionary for learners of English. They are based on the monolingual Cobuild learners’ dictionaries, and they are partly translated – they contain translated definitions and translation equivalents. This paper shows the possible ways of exploiting Bridge dictionaries for creating new bilingual or multilingual dictionaries. One possible way is to extract corresponding translation equivalents, edit them, and make a new printed dictionary. As both sides of such a dictionary were originally created as translations from English, the dictionary requires quite a lot of lexicographic work. Another possibility is to create an electronic version of the dictionary “as is”. For this purpose, it is necessary to convert the dictionary first into SGML format and define its DTD. This format can then serve as a standard for future Bridge dictionaries and adding new language modules to existing dictionaries would be quite easy.

Introduction Bridge dictionaries are a new sort of dictionary for learners of English. They are based on the monolingual COBUILD learners’ dictionaries, and they are partly translated – they contain translated definitions and translation equivalents. Bridge dictionaries can be used as a source for new bilingual dictionaries by combining the translation equivalents in the target languages. In my paper, I want to discuss the perspectives of such an approach and the problems that can occur. The idea of creating new bilingual (or even multilingual) dictionaries from Bridge dictionaries is described in Sinclair (1995), but as far as I know, this is the first attempt to create such a dictionary. I have at my disposal the full electronic version of the English-Czech COBUILD dictionary and letters A and G from the English-Lithuanian version of the dictionary. I experimented with alignment of the two dictionaries and with

JB[v.20020404] Prn:15/02/2007; 14:09



F: BCT808.tex / p.2 (110-195)

Hana Skoumalová

extracting the correspondent entries, but I also could examine the completeness of Czech translation equivalents.

Format of the input The entries in the printed Bridge dictionaries have the format similar to entries in any other dictionary: bridge /brıdŠ/, bridges. 1 COUNT N A bridge is a structure built over a river, road, or railway so that people or vehicles can cross from one side to the other. ♣ Bridge je konstrukce postavená pˇres rˇeku, silnici nebo železnici, aby lidé nebo vozy mohli pˇrejíždˇet z jedné strany na druhou. ♥ most. . . . the little bridge over the stream. 2 COUNT N Something that is a bridge between two groups or things makes it easier for the differences between them to be overcome. ♣ Co je bridge mezi dvˇema skupinami nebo ˇ vˇecmi, usnadnuje pˇrekonávání rozdílu mezi nimi. ♥ most. We need to build a bridge between East and West. 3 COUNT N The bridge of a ship is the high part from which the ship is steered. ♣ Bridge lodi je vysoké místo, odkud se kormidluje. ♥ mustek. 4 UNCOUNT N Bridge is a card game for four players. ♣ Bridge je karetní hra pro cˇ tyˇri hráˇce. ♥ bridge.

Every entry starts with the headword, its pronunciation, and morphological information. The meanings are numbered and they contain grammatical information, English definitions, and examples. In addition, they contain translations of the definitions to the target language and translation equivalents of the headword. The translated definitions contain the English headword placed according to word-order rules of the target language. The electronic versions of the dictionaries are tagged, so it is possible to distinguish automatically various parts of lexical entries, as headwords, pronunciation, English definitions, translated definitions, translated equivalents, examples, etc. The main purpose of the tagging, however, is not to show the structure of the dictionary but to instruct the printer about new lines, font changes, etc. This can cause problems during the processing, because not all authors employ the prescribed format. [EB] [LB] [HW]bridge [PR]/br*!id!z/, [IF]bridges. [LE] [MB] [MM]1 [GR]COUNT N [DT]A [HH]bridge [DC]is a structure built over a river, road, or railway so that people or vehicles can cross from one side to the other. [KD][HH]Bridge [DC]je konstrukce postavená pˇ res ˇ reku, silnici nebo

JB[v.20020404] Prn:15/02/2007; 14:09

F: BCT808.tex / p.3 (195-295)

Bridge dictionaries as bridges between languages

železnici, aby lidé nebo vozy mohli pˇ rejíždˇ et z jedné strany na druhou. [KQ]most. [XB] [XX]...the little bridge over the stream. [XE] [ME] [MB] [MM]2 ............

In the above example, we can see a part of the entry bridge. The mark-up of the original Cobuild dictionary was preserved; only several new tags were added for the Czech part of the dictionary. The new Czech tags are [KD] for translated definition, [KQ] for translation equivalents and [KR] for translated register notes. The structure of translated definitions is the same as in English – even the same tags are used ([HH] for the quoted headword, [DC] for continuation of the definition, etc.). In the Lithuanian dictionary, however, a different structure of the translated definitions was introduced: [DT]You can use [HH]a [DC]or [HV]an [DC]instead of the number ’one’. [AL][AH]A arba [AH]an galima vartoti vietoj skaiˇ ciaus ’vienas’. [AT]Vienas

The tags [AH] mark the next word as an English headword. The tag [DC] for continuation of the definition is not used – the space has the meaning of closing tag. This can cause problems in parsing and further processing of the dictionary, as we can see in the following example: [QQ][QS]See also [QH]current affairs, [QH]state of affairs. [AZ]Ži¯ ur˙ eti [AH]current [AH]affairs, [AH]state [AH]of [AH]affairs.

These two lines define cross-references in the dictionary. In the English version, it is transparent that current affairs and state of affairs are two targets of the cross-reference. In the Lithuanian translation, however, it is difficult to discover how many targets there are and where a boundary runs between them. To avoid this sort of problem, it seems reasonable to insist that authors should follow the original English mark-up in the translated parts as well.

Extraction of the translations for printed dictionary If we want to create a new printed dictionary from the two Bridge dictionaries, it is necessary to extract the corresponding translations, that is, lines starting with translation tags ([KQ] for Czech, [AT] for Lithuanian, or any other tag chosen by authors of a new Bridge dictionary for other languages): [KQ]neurˇ citý ˇ clen. [AT]Nežymimasis artikelis [KQ]neurˇ citý ˇ clen s významem ’jeden’ (bˇ ehem jedné hodiny). [AT]Vienas [KQ]za, po. [AT]Per.



JB[v.20020404] Prn:15/02/2007; 14:09



F: BCT808.tex / p.4 (295-383)

Hana Skoumalová

[KQ]takový. [AT]Nežymimasis artikelis [KQ]nevázanost, bezstarostnost. [AT]Laisvai, nesivaržant [KQ]faux pas, bota. [AT]Apsirikimas. [KQ]bˇ ežet tryskem, (rychlým) cvalem, cválat. [AT]B˙ egti šuoliais. [KQ]rychlý, prudký, (inflace) pádivý. [AT]Lekiantis. [KQ]zahradník, (allotment, hobby) zahrádkáˇ r. [AT]Sodininkas. [KQ]jít, jet. [AT]Eiti, važiuoti. [KQ]kladný hrdina, klad’as. [AT]Geras veik˙ ejas. [KQ]gymkhána. [AT]Sporto varžybos, lenktyn˙ es. [KQ]gynekologie, ženské lékaˇ rství. [AT]Ginekologija.

Since there can be more than one translation equivalent on every line, it is necessary to split the entries on the left side of the dictionary and sort them alphabetically. Then we can offer the result to lexicographers for further editing. The list of entries that we get is imperfect in many respects. Some of the imperfections are technical ones and can be improved by better tagging, but there are also problems caused by the fact that the left side of the dictionary was created by translation. Some of the problems are listed below: – –

Entries contain only the bare words, without grammatical information, without collocations, valence frames, register notes, usage examples, etc. The translated equivalents may differ in category. This can happen if the English headword is usually used as a part of a collocation and one of the translations is a translation of the whole collocation while the other translation is just a translation of the bare headword. In the English-Czech-Lithuanian dictionary, this was the case of the headword abandon in the collocation with abandon: En: If you do something with abandon you do it in a carefree way. ˚ ♥ nevázanost, Cz: ♣ Jestliže nˇeco dˇeláme s abandon, dˇeláme to bezstarostným zpusobem. bezstarostnost. Lt: ♣ Jei kažkas daroma with abandon tai atliekama ner¯upestingai. ♥ Laisvai, nesivaržant.

–

–

–

The Czech translation equivalents are nouns, while Lithuanian equivalents are adverbs. One of the translations is an idiomatic expression, like the Czech word bota (gaffe), whose main meaning is shoe. After splitting the left-hand side of the dictionary, we can get very strange correspondences. The expressions on the two sides belong to different registers. For example, the Czech expression kladný hrdina is neutral, while the word klad’as (goody) is colloquial; on Lithuanian side, however, we get only one translation geras veik˙ejas. The English words have a broader sense than their translations. For example, the word go is translated as jít (walk) or jet (ride) in Czech and as eiti (walk) or važiuoti (ride) in Lithuanian. In the single-entry dictionary, we get incorrect correspondences jít–eiti, važiuoti and jet–eiti, važiuoti.

JB[v.20020404] Prn:15/02/2007; 14:09

F: BCT808.tex / p.5 (383-457)

Bridge dictionaries as bridges between languages

–

–

–

–

–

In one or both languages, there is an explanation of the meaning rather than a translation. For example, this is the case of articles, which do not exist in all languages. If we get a multi-word expression on the left-hand side, the sorting program cannot distinguish the headword and can place the entry incorrectly. For example, the expression ženské lékaˇrství (gynecology) should be sorted under l rather than under ž. Multiple attributes belonging to one headword are separated by commas, and they are considered separate headwords. In the above example, this is the case of the Czech entry bˇežet tryskem, (rychlým) cvalem (to gallop). The headword is the verb bˇežet which is modified with adverbials tryskem or cvalem. The meaning of the headword may be restricted by an expression in parentheses which is either in English or in the target language: zahradník, (allotment, hobby) zahrádkáˇr (gardener) (inflace) pádivý (galloping). We should be able to distinguish the two cases. The choice of entries is driven by the “pivot” language, that is, English.

The last problem is worth a more thorough analysis – we meet two sources of problems here. The first one is that the dictionary contains many English words that are difficult to translate or whose translations are not good candidates for entries in a dictionary. The reasons are the following: – –

–

The word does not have any correspondent in the target language. For example, this is the case of articles. The word is translated as a multi-word expression (e.g., gallop – bˇežet cvalem – b˙egti šuoliais). Such an entry can occur on the left side of a dictionary if the translation to the target language is one word, but if both languages need multi-word expressions, this entry is unnecessary. The word is a very specific British term (e.g., gymkhana) whose translation should not occur on the left side in the dictionary.

Most of such entries will probably be deleted in the printed version of the new bilingual dictionary. On the other hand, some frequent words of the target languages may be missing: – –

–

Expressions formed with help of support verbs in English like grow old – zestárnout in Czech, give shout – vykˇriknout, etc. Because English is a language that can express attributes not only by adjectives but also with the help of nouns and genitives, many adjectives are missing in the Czech part of the dictionary: dámský – ladies’, cenový – price (Adj), etc. Multi-word expressions in English, like tennis player – tenista; editor in chief – šéfredaktor, etc.



JB[v.20020404] Prn:15/02/2007; 14:09



F: BCT808.tex / p.6 (457-525)

Hana Skoumalová

–

–

For languages which exploit verb aspects, usually only one representative of the whole “nest” is used on the right side. In the reverse dictionary, the other variants are missing and have to be added. Some frequent words of the target language mainly occur in original texts, but in translations, their synonyms are used. Thus in the Czech version of the Bridge dictionary frequent words such as aféra – scandal or uzávˇerka – deadline are missing.

In order to ensure that the resulting dictionary will be well-balanced, the lexicographers who edit the new bilingual dictionary will have to do a thorough revision of the entries on the left side, merge duplicated entries, delete some unnecessary entries, and add omitted words. In my sample dictionary of the English letters A and G, I got ca 4,700 entries for Czech-Lithuanian and ca 3,500 entries for Lithuanian-Czech dictionary from ca 1,500 English headwords (2,700 meanings). The whole COBUILD contains more than 16,000 headwords (30,000 meanings), so the estimated size of the raw Czech-Lithuanian dictionary is 52,000 entries and the size of the raw LithuanianCzech dictionary is 39,000 entries. However, it is now impossible to judge how many entries will be deleted and how many will have to be added.

SGML/XML format As mentioned above, some difficulties in the processing of the input data can be solved by improved tagging of the source. We also need a tool for parsing the input and for alignment of the dictionaries. A good solution is to convert the dictionaries to SGML or XML format and then use existing tools for further processing. After conversion to SGML, we can do the following: – – – – – – –

create canonical DTD of Bridge dictionaries; check existing dictionaries against the DTD and correct errors; add new tags that were not necessary for printing but are necessary for alignment; mark all cross-references in the structure; return the corrected input data; create tools for alignment; create an interface for browsing through the dictionaries (this will be described in the next section).

As was said before, we have to create DTD for Bridge dictionaries and to check the existing dictionaries against it. The content of an entry of the original COBUILD dictionary is described in Sinclair (1987), but 1) its format is not

JB[v.20020404] Prn:15/02/2007; 14:09

F: BCT808.tex / p.7 (525-574)

Bridge dictionaries as bridges between languages

described for all details, and 2) the description does not take into account the translation additions. All entries are not as simple as the example of the entry bridge above: they can contain phrases and phrasal verbs, whose structure is similar to the structure of the whole entry; some parts of the structure are optional (like examples, syntax-change, run-on, etc.); and it is not easy to state their mutual order, etc. Cross-references can occur both inside the lemma part of the entry as well as inside a meaning or a phrases block. Register notes are found in lemmas, in phrases, or in meaning parts after a definition, but they can also occur inside a definition. The creation of DTD thus could not be straightforward; the programs for conversion from the printing mark-up to SGML and the DTD were created simultaneously and they were adapted to each other’s needs. When the Lithuanian (or any other) dictionary is finished, we can check whether the proposed DTD also covers the new member of the club, and we can correct the DTD. I expect that it will only be necessary to add omitted parts without changing the structure dramatically. The result should become the canonical DTD that will serve as a standard for newly created Bridge dictionaries. If it were to be changed in the future, the changes should be mere additions of optional parts, so that the new DTD would be “backward compatible” with existing dictionaries.

Electronic version of the dictionary As we can see above, the way to a classical bilingual dictionary created on a base of two COBUILDs will be neither easy nor fast – it will require quite a large amount of lexicographic work. On the other hand, an electronic version of the dictionary can be published nearly immediately. In such a dictionary, the entire content of the two (or more) source dictionaries will be preserved. In Figure 1, you can see possible user’s interface. The upper window contains the English part of the dictionary with translated definitions. The lower windows contain translation equivalents in the two target languages: in the left window, the entries are sorted alphabetically; the right window contains corresponding entries in the second language. The electronic dictionary can be used in three ways: –

–

As a “classical” monolingual dictionary; only the upper window is displayed, and it contains only the English content – in fact, this is an electronic version of the original COBUILD. As a bilingual dictionary; the upper window contains the English content and translated definitions, and there is only one lower window with translation equivalents. The dictionary is searchable in both directions.



JB[v.20020404] Prn:15/02/2007; 14:09



F: BCT808.tex / p.8 (574-603)

Hana Skoumalová

–

As a multilingual dictionary; the upper window contains the English content and definitions translated to more than one language. The lower window is divided into several sub-windows; each of them contains translation equivalents to one language. Entries in these sub-windows are sorted after the leftmost subwindow. All the subwindows are searchable, so the user has access to all available multi-lingual dictionaries.

When a new Bridge dictionary is created observing the canonical DTD, it can be aligned with the existing dictionaries, and a new language module can be “plugged in”. Thus the number of languages in the multilingual dictionary can be increased without additional effort. In the current version of the dictionary, only search through headwords is possible, but, in the future, the full-text search in the definitions can be added. The full-text search can then make use of the definition parser (Barnbrook and Sinclair, 1999) and can be used for discovering correspondences not only in the translation equivalents but also in the definitions.

Figure 1. Electronic dictionary

JB[v.20020404] Prn:15/02/2007; 14:09

F: BCT808.tex / p.9 (603-635)

Bridge dictionaries as bridges between languages

Conclusion As was shown in this paper, Bridge dictionaries are a promising source for creating new bilingual and multilingual dictionaries. The electronic version of the new dictionary can be produced with a small amount of effort, and it can be helpful for lexicographers working on new printed dictionaries, whether they use material contained in a Bridge dictionary or collect their data in another way. Bridge dictionaries provide an opportunity for new bilingual dictionaries of language pairs that would probably never be created by classical methods.

References Anglicko-ˇceský výkladový slovník. 1998. Prague: Nakladatelství Lidové Noviny. Barnbrook, G. and J. Sinclair. 1999. Specialised Corpus, Local and Functional Grammars. Unpublished manuscript. ˇ Cermák, F. and R. Blatná (eds.). 1995. Manuál lexikografie, Prague: H & H. Sinclair, J. (ed.). 1987. Looking Up: An account of the COBUILD Project in lexical computing. London and Glasgow: Collins ELT. Sinclair, J. 1995. The Bridge Club. Unpublished manuscript.



JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.1 (45-107)

Procedures in building the Croatian-English parallel corpus Marko Tadi´c

This contribution gives a survey of procedures and formats used in building the Croatian-English parallel corpus which is being collected at the Institute of Linguistics at the Philosophical Faculty, University of Zagreb. The primary text source is the newspaper Croatia Weekly which has been published from the beginning of 1998 by HIKZ (Croatian Institute for Information and Culture). After a quick survey of existing English-Croatian parallel corpora, the article copes with procedures involved in text conversion and text encoding, particularly the alignment. There are several recent suggestions for alignment encoding, and they are listed and elaborated at the end of the article.

.

Introduction

For any kind of research involving two or more languages such as multilingual lexicography, contrastive linguistics, machine translation, etc., parallel corpora are of essential importance. Knowing the role of English today as lingua communis, it is no surprise that the most common pairing of languages in parallel bilingual corpora is English: Lx . This is the reason why we chose English as a pair to the Croatian from the beginning. Many scholars probably do not know that this very language pairing in parallel corpora started more than 30 years ago: Professor Rudolf Filipovi´c launched the Yugoslav Serbo-Croatian–English Contrastive Project 1 in 1968. The preliminary idea was brought to Zagreb by Professor Bujas in 1967 when he returned from Austin, Tx. (Bujas 1967). Until 1971, when the project ended, the Brown corpus was . The ‘Serbo-Croatian’, ‘Croato-Serbian’, or ‘Croatian or Serbian’ was the official name for the Croatian language under communist authorities which tried to unify it with the Serbian language by force and to suppress any kind of Croatian language specifics which were considered dangerous for that unification process. The same name still persists in the Serbian part of former Yugoslavia and in many Slavistic handbooks. This name for the project was the only one allowed at that time.

JB[v.20020404] Prn:15/02/2007; 14:44



F: BCT809.tex / p.2 (107-163)

Marko Tadi´c

acquired, cut in half (505 822 tokens) preserving the original 15 genre balance, and morphosyntactically marked and translated (Bujas 1969: 36). The concordance with morphosyntactic categories as keywords was produced as well as a bilingual sentence database (Bujas 1975: 53). As far as we know, this was the first implementation of computers in contrastive linguistics. Computer data tapes still exist at the Institute of Linguistics, but, unfortunately, it is impossible to find a computer system which would be able to read them – so they are of no practical use today. Nevertheless, the project resulted in a great number of publications, primarily in the field of contrastive linguistics, known as Contrastive Studies, New Contrastive Studies and Chapters in Contrastive Linguistics, all published by the Institute of Linguistics, Philosophical Faculty, University of Zagreb. The second Croatian-English parallel corpus is the translation of Plato’s Republic, published on TELRI CD-ROM (Erjavec, Lawson, Romary 1998); however, the Croatian-English language pair is not the only one, and it was certainly not of the primary interest. Since the whole work is well known to the TELRI community and wider, we will go on with our topic. The third Croatian-English parallel corpus has been collected in the scope of example based machine translation project known to us only by paper from the LREC 1998 conference (Allen & Hogan 1998). The size of that corpus is about 0.85 Mw the Croatian part and about 0.78 Mw for the English part (Allen & Hogan 1998: 749).

. Corpus The Croatian-English parallel corpus, which is now being collected at the Institute of Linguistics at the Philosophical Faculty, University of Zagreb is the fourth Croatian-English corpus pair. Its primary aim was to investigate procedures of text-conversion, corpus collection/organization, sentence alignment, and corpus encoding which would be used in later parallel corpora projects, such as the Croatian-Slovene parallel corpus, which was approved by both Ministries of Science in July 1999 and was effectively launched in October 1999. This fact must be pointed out. It is this very spot where TELRI could find affirmation of its international efforts: two members of the TELRI Association were able to launch a bilateral project approved at the formal national level of their Ministries of Science. Formally, this project exists outside the TELRI framework, but without the TELRI “stirring pot,” it would not have been possible. In corpora collection, there are several factors which should be kept under control. The representativeness of the corpus is one of them – an ideal which is hard to achieve, yet everyone is trying to come to its vicinity. The situation is even

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.3 (163-206)

Procedures in building Croatian-English corpus

worse in the case of parallel corpora since the demand for parallelism narrows the already limited choice of texts. Also for languages with a small number of speakers and/or translators such as Croatian, one can be happy to obtain any valuable translations. The outcome is usually a rather unbalanced set of bitexts because you have to take whatever you can get in digital form. It would be “methodologically cleaner” to have a corpus originating from one text source, which you could call Corpus of This-and-That. Fortunately, we found ourselves in such a situation. The source of texts is the newspaper Croatia Weekly, published by HIKZ (Croatian Institute for Culture and Information) from the beginning of 1998 until May 2000. The publication is similar to USA Today in a Croatian way – it covers different domains: politics (internal and foreign), economy and finance, tourism, ecology, culture, art, sports, and events, and it is intended for the public abroad. It is contains 16 pages (including 4 pages for advertising) giving us an average of 14 400 tokens per issue for Croatian and 17 400 for English. The last issue published is number 118, and we have access to the digital form of all texts in both languages except for the first 5 issues. Thus 113 issues provides approximately 1.6 Mw for Croatian and approximately 1.9 Mw for English. The only problem which could cast a shadow on our “methodological happines” is the fact that the most popular weekly in Croatia, Nacional, which is one of the most important Croatian language sources for our Croatian National Corpus, started with English translations on its Web page. These translations cover approximately 15% of the original Croatian texts. Now, choosing the text candidates for the corpus, we are in the position to decide between “methodological purity” and the size as well as topic variation. For the time being, we will stick to only one text source – Croatia Weekly. In future versions of the corpus, texts from other sources will be included.

. Making the corpus . Platform Surprisingly, our platform is not UNIX – all software (commercial, shareware, and custom made) runs on Windows 9*/NT. A few years ago this would be peculiar, but today, when language technologies have already descended to the market level, it seems to be a mere technical exercise. . Text formats Croatian texts, delivered by the publisher to a professional translation bureau, are available in “bare ASCII” format, completely stripped of any markup. Thus, for



JB[v.20020404] Prn:15/02/2007; 14:44



F: BCT809.tex / p.4 (206-318)

Marko Tadi´c

the Croatian half of the pair, markup has to be done by macros and scripts used in commercial text-processors (MS Word 97). The English texts are supplied in typesetting format (QuarkXPress 3.32); we extract them as RTF files and process them further. . Conversion We have designed an application called 2XML and engaged an independent software company to do the programming work. The application performs conversion by applying user-defined scripts to input in the form of RTF or HTML file, resulting in output, delimited at the beginning and at the end with ..., which is “full blown” XML. Figure 1 gives the overview of the script-editing page of the 2XML application. The conversion is made in two steps: 1) the program produces the “dirty” XML with

marked only, where certain HTML and/or RTF attributes (typeface name, font size, margin alignment, style name, etc.) are preserved (Fig. 2 shows just few of them);

Figure 1.

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.5 (318-318)

Procedures in building Croatian-English corpus

Figure 2.

Figure 3.



JB[v.20020404] Prn:15/02/2007; 14:44



F: BCT809.tex / p.6 (318-329)

Marko Tadi´c

2) The user-defined script is run on the “dirty” XML file, producing the final, “clean” XML file where HTML and/or RTF attributes, preserved from the first stage, are replaced by XML opening and closing tags. Usually these are different kinds of

s and s with their specific attribute values defined by script (Fig. 3). TOKEN

FILE NAME

BYTE OFFSET

FLAG

<S> Predsjednik Tud-man primio Kinkela , Vedrinea i Primakova <S> Tud-man : Hrvatska vojno , gospodarski i sigurnosno orijentirana na europske integracije . <S> Ministri Vedrine i Kinkel uputili zahtjev Hrvatskoj da izradi konkretan plan povratka izbjeglica .

CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr CW011199803260101hr

1 7 25 41 44 56 63 70 77 79 88 90 99 103 110 126 129 135 137 146 151 153 165 167 178 191 194 203 214 215 220 223 232 240 242 249 257 265 275 278 285 295 300 309 319

X X X X R R R R I R R R X X X X R I R R I R R R R R R R I X X R R R R R R R R R R R R R I

Figure 4.

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.7 (329-416)

Procedures in building Croatian-English corpus <S><W type="R">Predsjednik <W type="R">Tuđman <W type="R">primio <W type="R">Kinkela<W type="I">, <W type="R">Vedrinea <W type="R">i <W type="R">Primakova <S><W type="R">Tuđman<W type="I">: <W type="R">Hrvatska <W type="R">vojno<W type="I">, <W type="R">gospodarski <W type="R">i <W type="R">sigurnosno <W type="R">orijentirana <W type="R">na <W type="R">europske <W type="R">integracije<W type="I">. <S><W type="R">Ministri <W type="R">Vedrine <W type="R">i <W type="R">Kinkel <W type="R">uputili <W type="R">zahtjev <W type="R">Hrvatskoj <W type="R">da <W type="R">izradi <W type="R">konkretan <W type="R">plan <W type="R">povratka <W type="R">izbjeglica<W type="I">.

Figure 5.

All that has to be done after the conversion is to attach the header, and then the completely formatted XML document is ready for inclusion into the corpus. At the time of writing this paper, the 2XML application is in high beta stage. . Sentence delimiting Sentence marking is accomplished by script applied by shareware Search&Replace V3.0 by Funduc Software Ltd. which allows regular expressions, scripts, etc. The <S> insertion is done in a familiar way, after punctuation followed by a capital letter. After that, output is filtered for exceptions like dr., prof., mr., ms., miss., ing., st., sv., initials, etc. Tokenizer is another tool which comes in the bundle with 2XML. It analyzes XML input, delimits tokens, and flags them as R (= word), B (= number), I (= punctuation), X (= XML tag). Output can be a tabbed file for input in database (Fig. 4) or tokenized in form like (Fig. 5), which is suitable for further processing. But prior to the word segmentation, sentence alignment has to be performed. . Aligning Alignment at the sentence level is in the test stage. We are testing two alignment programs: a translation memory database system DéjàVu 2.3.82 by Atril and Vanilla aligner by Pernilla Danielson and Daniel Ridings (Danielsson and Ridings 1997).



JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.8 (416-510)

 Marko Tadi´c

. Aligning with DéjàVu2 The demo version of the DéjàVu translation memory database system has a fully functional aligning module with a rather user friendly interface (Fig. 6). Export from that translation memory database to TMX by means of a built-in export filter would look like in (Fig. 7). Does it look OK? Definitely not, because all levels above <S> are incorporated in and <SEG> elements and that is not what we would expect. Besides, Figure 6 demonstrates that there is a lot of discrepancy in alignment between languages which requires a great deal of manual post-processing. . Aligning with Vanilla aligner3 Vanilla aligner (DOS version) gives better results with less alignment mistakes, even in one-to-many cases, but neither its interface is friendly nor is its output encoded the way we wanted (Fig. 8). The same problem of higher element levels incorporated in aligned segments is still present. So we have . . . . Encoding problems How to store alignments? Do we have a common way to encode them since we use XML? There is a number of ways how to do it now both in SGML and XML encoding: 1. Pointers stored in separate document: 1.1. Corpus encoding standard (Ide 1998 and CES4 ) are defined in SGML, with extensive use of ID attributes in <S> elements and pointers to them (example from CES 5.3.4.2): DOC1: <s id=p1s1>According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products. <s id=p1s2>Cola drink manufacturers in particular achieved aboveaverage growth rates.

. Here I would like to express our thanks to Tomaž Erjavec and his colleagues from Ljubljana who gave us invaluable advice and tried to save us from wandering around. How much they succeeded in that respect is yet to be seen. . Here I would like to express our thanks to Milena Slavcheva and the Mannheim TELRI team, who provided us with that software. See also http://nl.ijs.si/telri/Vanilla . See http://www.cs.vassar.edu/CES/

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.9 (510-510)

Procedures in building Croatian-English corpus

Figure 6.

Figure 7.



JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.10 (510-528)

 Marko Tadi´c

Figure 8.

DOC2:

<s id=p1s1>Quant aux eaux minérales et aux limonades, elles rencontrent toujours plus d’adeptes. <s id=p1s2>En effet, notre sondage fait ressortir des ventes nettement supérieures à celles de 1987, pour les boissons à base de cola notamment. ALIGN DOC:

1.2. TEI Lite DTD was converted to XML in May 1999 by Patrice Bonhomme.5 Since we are using XML, this is the possible candidate for our encoding system. The usage of pointers to IDs and storage to different documents remains very much the same as in CES. 2. Translation memory (TMX6 ) inspired type of alignment encoding:

. See http://www.loria.fr/∼bonhomme/XML and http://www.loria.fr/ ∼bonhomme/xteilite0_6.zip . See http://www.lisa.unige.ch/tmx/

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.11 (528-599)

Procedures in building Croatian-English corpus 

2.1. Since we have chosen XML, one would expect to use the PLUG project DTD7 which groups sentences in segments like the example from Tiedemann (1998: 11): <doc.body> <seg lang=’sv’> <s> Eders Majestäter, Eders Kungliga Högheter, herr talman, ledamöter av Sveriges riksdag! <seg lang=’en’> <s> Your Majesties, Your Royal Highnesses, Mr Speaker, Members of the Swedish Parliament.

The problem with this encoding system is that all upper levels of markup are lost since the of the document is reorganized in a string of elements. These elements further contain <SEG> elements which are actually aligned and accompanied with explicit language markers. Actual <S> elements are embedded in <SEG>s. 2.2. The ELAN Slovene-English parallel corpus8 was encoded in TEI SGML. The TEI element was redefined to be a string of translation units ( elements) which are formed by pairs of aligned <SEG> elements:9 <seg lang="sl"><w>Slovenija <w>je <w>ozemeljsko <w>enotna <w>in <w>nedeljiva <w>država. <seg lang="en"><w>Slovenia <w>is <w>a <w>territorially <w>indivisible <w>state.

In this solution, it is important to notice that the <SEG> element is not composed of <S> but, unlike in the PLUG project, of <W> and elements. The proper alignment between the sentences is not marked explicitly, but they are deductible from <SEG> opening and closing tags as well as from the . In Tiedemann (1998: 8). See also http://numerus.ling.uu.se/∼corpora/plug . Erjavec (1999a: 27). See also http://nl.ijs.si/elan/ . Erjavec (1999b: 4).

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.12 (599-663)

 Marko Tadi´c

elements which could serve as the sentence-boundary markers in the case of alignment which is not one-to-one.10 Similar to the PLUG DTD, to which this solution also referres, all upper-level encoding (

s, s, etc.) is lost. Is there a way to keep aligned sentences together in the same element while retaining upper levels of text encoding? Could it be possible in the same document to have aligned only those parts of document structure which show actual translation and keep the rest of structure unique for both languages? Ideally that would look like a structure with preserved higher levels and aligned <SEG> elements just above the <S> level. That kind of encoding is certainly more readable for humans and needs less text storage (Fig. 9). <SEG lang="hr"> <S>Ovdje je reč enica 1 koja uključuje i 2. <SEG lang="en"> <S>Here comes the sentence No 1. <S>This is sentence No 2. ... ...

<SEG lang="hr"> <S>Ovdje je rečenica 3. <SEG lang="en"> <S>Here comes the sentence No 3. ... ...

...

Figure 9.

. Part of the Slovene-English ELAN corpus, namely, the Orwell’s 1984 component, has <S> elements marked inside <SEG> elements.

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.13 (663-726)

Procedures in building Croatian-English corpus 

Although this kind of encoding looks attractive, there are several remarks, which can be said about it. First of all, the DTD would have to be more complicated because the element should be included in virtually any element which allows

. However, it stands in conflict with the general demand, formulated in CES, for keeping the original document unchanged as much as possible. That demand is even unavoidable with read-only source documents (see Thompson and McKelvie 1997). Furthermore, the type of encoding shown in Figure 9 is actually redundant and can be generated from the documents encoded by the system mentioned in point 1.1. or 1.2. Since our Croatian-English Parallel Corpus project is at the beginning, the decision about the alignment encoding system remains to be made in the near future. However, it seems that for <S> elements alignment we would have quite a lot of checking. The amount of “handwork” can be seen from statistics that show significant discrepancy in the number of <S> and <W> elements in Croatian and English: CW010:

CW011:

CW012:

CW013:

Avg.

<S> <W>

<S> <W>

Hr 195 729 15483 178 675 14853 174 652 17317 174 679 17163 180.25 683.75 16204

En 195 796 18176 178 754 17602 174 733 20193 174 767 19902 180.25 762.50 18968.25

% increase 9.2 17.4 11.7 18.5 12.4 16.6 13.0 16.0 11.5 17.1

The first question coming to one’s mind is: Is it a regular difference or the result of inadequate translation? The ELAN Slovene-English parallel corpus shows an even stronger tendency towards EN token prevalence: SI: 510 533 and EN: 632 218 meaning a 23.8% increase. The <S> correspondence between Slovene and English is also mentioned (SI: 25 572 and EN: 24 993 meaning a 2.3% decrease), but in Vintar (1999: 64), it is not clear how those numbers were acquired. They could not have been investigated without a further sentence segmentation of the original corpus data because of the type of encoding used and described above in 2.2. Here the <S>-element Slovene-English correspondence is different from Croatian-English, and it is probabily due to the fact that the Croatian-English corpus is collected from only one source while Slovene-English is compiled from

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.14 (726-782)

 Marko Tadi´c

15 different text sources. It would be interesting to see data from other Slavic languages paired with English in a parallel corpus.

. Conclusions This paper presented the starting-point of the collecting and encoding of the Croatian-English Parallel corpus. As we proceed with the development of this language resource, where lack of Croatian language was more than evident, the referring data will be made available on http://www.hnk.ffzg.hr/hr-en_pcorp. What is important at this point is the completion of the alignment along with the decision about its encoding. Further steps would be widening the corpus with texts from other sources and including the refined annotation, particularly at the <W> level. Lemmatization and MSD for English should not be a problem today, but for Croatian, we plan cooperation with our Croatian National Corpus project where the module for Croatian lemmatization and MSD annotation of corpora is being developed11 in cooperation with MulTextEast V2 initiative.

. Acknowledgements The author would like to thank Ivana Simeon and Krešimir Šojat for work done in the process of converting the original files. Thanks is due to the Croatian Institute for Culture and Information, the publisher of Croatia Weekly who provided us with source texts for this corpus.

References Ahrenberg, Lars; Merkel, Magnus; Ridings, Daniel; Sågvall Hein, Anna; and Tiedemann, Jörg. 1999. Automatic processing of parallell corpora: A swedish perspective. (http:// numerus.ling.uu.se/∼corpora/plug/) Allen, Jeffrey and Christopher Hoga. 1998. “Expanding Lexical Coverage of Parallel Corpora.” First International Conference on Language Resources and Evaluation, LREC’98. 747–754. Granada: ELRA. Bujas, Željko. 1967. “Concordancing as a Method in Contrastive Analysis.” Studia Anglica et Romanica Zagrabiensia, 23: 49–62. Bujas, Željko. 1969. “Computers in the Yugoslav Serbo-Croatian/English Contrastive Analysis Project.” ITL Review for Applied Linguistics, Spring 1969: 35–42.

. For the Croatian National Corpus visit http://www.hnk.ffzg.hr

JB[v.20020404] Prn:15/02/2007; 14:44

F: BCT809.tex / p.15 (782-835)

Procedures in building Croatian-English corpus 

Bujas, Željko. 1975. “Computers in the Yugoslav Serbo-Croatian – English Contrastive Project.” Bilten Instituta za lingvistiku Zagreb 1: 44-58. Danielsson, Pernilla and Ridings, Daniel. 1997. “Practical presentation of a “vanilla” aligner,” ed. by Reyle, U. and Rohrer, C. Presented at the TELRI Workshop on Alignment and Exploitation of Texts. Institute Jožef Stefan, Ljubljana (http://svenska.gu.se/ PEDANT/workshop/workshop.html). Erjavec, Tomaž; Lawson, Ann; Romary, Laurent. 1998. East meets West – A Compendium of Multilingual Resources. 2 CD-ROMs. Mannheim: TELRI-IDS. Erjavec, Tomaž. 1999a. “Making the ELAN Slovene/English Corpus.” Proceedings of the Workshop Language Technologies – Multilingual Aspects, ed. by Vintar, Špela. 23–30. Ljubljana: Department of Translation and Interpreting, Faculty of Arts, Univ. of Ljubljana. (http://nl.ijs.si/et) Erjavec, Tomaž. 1999b. “A TEI encoding of aligned corpora as translation memories.” Proceedings of the EACL-99 Workshop on Linguistically Interpreted Corpora (LINC-99), Bergen. ACL. Ide, Nancy. 1998. “Corpus Encoding Standard: SGML guidelines for encoding linguistic corpora.” First International Conference on Language Resources and Evaluation, LREC’98. 463-470 Granada: ELRA. (http://www.cs.vassar.edu/CES/) Thompson, Henry and David McKelvie. 1997. “Hyperlink semantics for standoff markup of read-only documents.” SGML Europe’97. (http://www.ltg.ed.ac.uk/∼ht/sgmleu97.html) Tiedemann, J. 1998. Parallel corpora in Linköping, Uppsala and Göteborg (PLUG). Work package 1. Department of Linguistics, Uppsala University. (http://numerus.ling.uu.se/ ∼corpora/plug/) Vintar, Špela. 1999. “A Lexical Analysis of the IJS-ELAN Slovene-English Parallel Corpus.” Proceedings of the Workshop Language Technologies – Multilingual Aspects, ed. by Vintar, Š. 63–70. Ljubljana: Department of Translation and Interpreting, Faculty of Arts, Univ. of Ljubljana.

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.1 (45-150)

Corpus linguistics and lexicography* Wolfgang Teubert

Corpus linguistics – More than a Slogan? During the last decade, it has been common practice among the linguistic community in Europe – both on the continent and on the British Isles – to use corpus linguistics to verify the results of classical linguistics. In North America, however, the situation is different. There, the Philadelphia-based Linguistic Data Consortium, responsible for the dissemination of language resources, is addressing the commercially oriented market of language engineering rather than academic research, the latter often being more interested in universal grammar or semantic universals than in the idiosyncrasies of natural languages. American corpus linguists such as Doug Biber or Nancy Ide and general linguists who are corpus users by conviction such as Charles Fillmore are almost better known in Europe than in the United States, which is even more astonishing when we take into account that the first real corpus in the modern sense, the Brown Corpus, was compiled in Providence, R.I., during the sixties. Meanwhile, European corpus linguistics is gradually becoming a sub-discipline in its own right. Unfortunately, during the last few years, this lead to a slight bias towards those ‘self-centred’ issues such as the problems of corpus compilation, encoding, annotation and validation, the procedures needed for transforming raw corpus data into artificial intelligence applications and automatic language processing software, not to mention the problem of standardisation with regard to form and content (cf. the long-term project EAGLES [Expert Advisory Group on Language Engineering Standards] and the transatlantic TEI [Text Encoding Initiative]). Today, these issues often tend to prevail over the original gain that the analysis of corpora may contribute to our knowledge of language. But it was exactly this corpus-specific knowledge that the first generation of European cor* This contribution is a revised version of my article ‘Korpuslinguistik und Lexikographie’ in Deutsche Sprache 4/99, pp. 292–313, translated into English by Norbert Volz.

JB[v.20020404] Prn:15/02/2007; 15:06



F: BCT810.tex / p.2 (150-212)

Wolfgang Teubert

pus linguists such as Sture Allen, Vladislav Andrushenko, Stig Johannson, Ferenc Kiefer, Bernard Quemada, Helmut Schnelle, or John Sinclair had in mind. In West Germany, the Institut für Deutsche Sprache was among the first institutions that considered the collection of corpus data as one of their permanent tasks; its corpora date back as early as the late sixties, although at that time most corpus data was still only used for the verification of research results gained from traditional methods. But has today’s corpus linguistics really advanced from there? The recent textbooks claiming to provide an introduction to corpus linguistics still do not add up to more than a dozen – all of them in English. Unfortunately, except for the commendable books of Stubbs 1996 and Biber, Conrad and Reppen 1998, they do deplorably little to establish corpus linguistics as a linguistic discipline in its own right. Instead, they are focussing on the use of corpora and corpus analysis in traditional linguistics (syntax, lexicology, stylistics, diachrony, variety research) and applied linguistics (language teaching, translation, language technology). Corpus Linguistics by Tony McEnery and Andrew Wilson (McEnery and Wilson 1996) may serve as an example of this kind. Forty pages describe the aspects of encoding; 20 pages deal with quantitative analysis; 25 pages describe the usefulness of corpus data for computer linguistics with another 30 pages covering the use of corpora in speech, lexicology, grammar, semantics, pragmatics, discourse analysis, sociolinguistics, stylistics, language teaching, diachrony, dialectology, language variation studies, psycholinguistics, cultural anthropology and social psychology and the final 20 pages contain a case study on sublanguages and closure. McEnery and Wilson’s book reflects the current state of corpus linguistics. In fact, it more or less corresponds to the topics covered at the annual meetings held by the venerable IACME, an association dealing with English language corpora (cf. Renouf 1998). Semantics are mainly left aside. Surprisingly, when judged by their commercial value, it is not the written language corpora that are most successful, but rather speech corpora that can claim the highest prices. Speech corpora are special collections of some carefully selected text samples (words, phrases, sentences) spoken by numerous different speakers under various acoustic conditions. They caused the final breakthrough in automatic speech recognition that computer models based on cognitive linguistics failed to achieve for many years. The recognition of speech patterns was only made possible by a combination of categorial and probabilistic approaches towards a connectionist model trained on large speech corpora. Thus, speech analysis can thus be seen as an early impetus for the establishment of corpus linguistics as an independent discipline with its own theoretical background. Lexicography is the second major field where corpus linguistics not only introduced new methods, but also extended the entire scope of research, however, without putting too much emphasis on the theoretical aspects of corpus-based lexicography. Here again, it was John Sinclair who lead the way as initiator of

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.3 (212-271)

Corpus linguistics and lexicography

the first strictly corpus-based dictionary of general language (COBUILD 1987). Britain was also the site of the first corpus-based collocation dictionaries (such as Kjellmer 1994). Bilingual lexicography may also benefit from a corpus-oriented approach: a fact that is evident when comparing the traditional Le Robert & Collins English-French Dictionary edited by B.T.S. Atkins with Valerie Grundy and Marie-Hélène Corréard’s Oxford-Hachette Dictionary which covers the same language pair. Here, the use of (monolingual) corpora lead to a remarkably greater number of multi-word translation units (collocations, set phrases) and to context profiles that had been written with the target language in mind. Wörter und Wortgebrauch in Ost und West [Words and Word Usage in East and West Germany] (1992) by Manfred W. Hellmann may serve as the only German example of that era, using the corpus for lemma selection rather than semantic description. Only recently, in 1997, did a true corpus-based dictionary appear: Schlüsselwörter der Wendezeit [Keywords during German Unification] by Dieter Herberg, Doris Steffens and Elke Tellenbach. Thus, at least in the field of written language, corpus linguistics is still in its infancy as a discipline with its own theoretical background – a statement which holds true not only for Germany but also for most other European countries. In this orientation phase, where corpus linguistics is still in the process of defining its position, most publications are in English, the language that has become interlingua of the modern world. But this does not mean that corpus linguistics is dominated mainly by English and American scholars: this can be clearly seen when browsing through any issue of the International Journal of Corpus Linguistics. Still, German linguistics appears somewhat underrepresented in this discussion. One exception is Hans Jürgen Heringer. His innovative study on ‘distributive semantics’ shows a growing reception of the programme for corpus linguistics which is outlined below. In his book Das höchste der Gefühle [The most sublime of feelings] (Heringer 1999), he describes the validation of semantic cohesion between adjacent words on the basis of larger corpora. Above all, it is this area between lexis and syntax where corpus linguistics offers new insights.

Corpus linguistics – A programme Corpus linguistics believes in structuralism as defined by John R. Firth; therefore, it insists on the notion that language as a research object can only be observed in the form of written or spoken texts. Neither language-independent cognition nor propositional logic can provide information on the nature of natural languages. For these are, as stated in an apophthegm by Mario Wandruszka, characterised by a mixture of analogy and anomaly. The quest for a universal structure of grammar and lexicon which is typical for the followers of Chomsky or Lakoff cannot meet



JB[v.20020404] Prn:15/02/2007; 15:06



F: BCT810.tex / p.4 (271-301)

Wolfgang Teubert

the demands of these two aspects.1 Instead, corpus linguistics is closer to the semantic concept inherent in the continental European structuralism of Ferdinand de Saussure, which regards the meaning as inseparable from the form, that is, the word, the phrase, the text. In this theory, the meaning does not exist per se. Corpus linguistics rejects the ubiquitous concept of the meaning being ‘pure information,’ encoded into language by the sender and decoded by the receiver. Corpus linguistics, instead, holds that content cannot be separated from form, rather they constitute the two aspects under which texts can be analysed. The word, the phrase, the text is both form and meaning. The above statement clearly outlines the programme of corpus linguistics. It is mainly interested in those phenomena on the fringe between syntax and lexicon, the two subjects of classical linguistics. It deals with the patterns and structures of semantic cohesion between text elements that are interpreted as compounds, multi-word units, collocations and set phrases. In these phenomena, the importance of the context for the meaning becomes evident. Corpus linguistics extends our knowledge of language by combining three different approaches: the (procedural) identification of language data by categorial analysis, the correlation of language data by statistical methods and finally the (intellectual) interpretation of the results. Whilst the first two steps should be done automatically as much as possible, the last step requires human intentionality, as any interpretation is an act involving consciousness and, therefore, not transmutable into an algorithmic procedure. This is the main difference between corpus linguistics and computational linguistics, which reduces language to a set of procedures. Corpus linguistics assumes that language is a social phenomenon, to be observed and described above all in accessible empirical data – as it were, communication acts. Corpora are cross-sections through a universe of discourse which incorporates virtually all communication acts of any selected language commu. The rules that those followers of a universal grammar hope to find in their quest for the language organ are not based on deductions of analogy. Whereas rules based on innateness had been the central factor in Chomskyan language theory until recently (cf. Stephen Pinker in The Language Instinct [Pinker 1994]), Pinker now sees language faculty as an interaction between ‘distinct mental mechanisms’ which is not yet fully explored, namely, the ‘symbolic computation’ [i.e., the algorithmic processing of uninterpreted symbols] as opposed to the ‘memory’ [i.e., recollection], the latter being responsible for the assignment of form and meaning of symbols (Pinker 1999). The memory is seen as partly associative – an appropriate term for its description could be ‘connectionist network’. However, Pinker still sees ‘symbolic computation’ as a strictly rule-based process. We may assume that this tentative change in attitude towards language faculty and the extent of its genetic embedding might be partly due to Terrence W. Deacon’s convincing explanation of first language acquisition which does without any language organ (Deacon 1997).

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.5 (301-353)

Corpus linguistics and lexicography

nity, be it monolingual (e.g., German or English), bilingual (e.g., South Tyrolean, Welsh) or multilingual (e.g., Western European). However, the majority of texts that are preserved and made accessible through corpora in principle only have a limited life-span: most printed texts such as newspaper texts are out of public reach within a very short time. If we consider language as a social phenomenon, we do not know – and do not want to know – what is going on in the minds of the people, how the speaker or the hearer is understanding the words, sentences and texts that they speak or hear. Language as a social phenomenon manifests itself only in texts that can be observed, recorded, described and analysed. Most texts happen to be communication acts, that is, interactions between members of a language community. An ideal universe of discourse would be the sum of all communication acts ever uttered by members of a language community. Therefore, it has an inherent diachronic dimension. Of course, this ideal universe of discourse would be far too large for linguistics to explore it in its entirety. It would have to be broken down into crosssections with regard to the phenomena that we want to describe. There is no such thing as a ‘one-size-fits-all’-corpus. It is the responsibility of the linguist to limit the scope of the universe of discourse in such a way that it may be reduced to a manageable corpus, by means of parameters such as language (sociolect, terminology, jargon), time, region, situation, external and internal textual characteristics, to mention just a few. When looking towards language as a social phenomenon, we assume that meaning is expressed in texts. What a text element or text segment means is the result of negotiation among the members of a language community, and these negotiations are also part of the discourse. Thus, the language community sets the conventions on the formal correctness of sentences and on their meaning. Those conventions are both implicit and dynamic; they are not engraved in stone like commandments. Any communication act may utilise syntactic structures in a new way, create new collocations, introduce new words or redefine existing ones. If those modifications are used in a sufficient number of other communication acts or texts, they may well result in the modification or amendment of an existing convention. One basic difference between natural and formal languages is the fact that natural language not only permits but actually integrates metalinguistic statements without explicitly marking the metalinguistic level. There is no separation between object language and metalanguage. Any convention may be discussed, questioned or even rejected in a text. Above all, discourses deal with meaning, and it is corpus linguistics that is best suited to deal with this dynamic aspect of meaning. We, as linguists, have no access to the cognitive encoding of the conventions of a language community. We only know what is expressed in texts. Dictionaries, grammars, and language textbooks are also texts; therefore, they are part of the universe of discourse. As long as they represent socially accepted standards, we



JB[v.20020404] Prn:15/02/2007; 15:06



F: BCT810.tex / p.6 (353-415)

Wolfgang Teubert

have to consider their special status. Still, their contents are neither comprehensive nor always based on factual evidence. Corpus linguistics, on the other hand, aims to reveal the conventions of a certain language community on the basis of a relevant corpus. In a corpus, words are embedded in their context. Corpus linguistics is, therefore, especially suited to describe the gradual changes in meaning: it is the context which determines the concrete meaning in most areas of the vocabulary.

Cognitive linguistics, logical semantics and corpus linguistics People normally – if they are not linguists, that is – listen to or read texts because of their meaning. They are interested in the syntactic features of phrases, sentences or texts only insofar as is necessary for understanding them. Meaning is the core feature of natural language, and this is the reason why semantics is the central linguistic discipline. Still, regardless of the enormous progress that phonology, syntax and many other disciplines have made, when it comes to explaining and describing the meaning of phrases, sentences, and texts, we are far from a consensus. As said above, corpus linguistics regards language as a social phenomenon. This implies a strict division between meaning and understanding. Is it really the task of linguistics to investigate how the speaker and the listener understand the words, sentences or texts that they utter or perceive? Understanding is a psychological, a mental, or – in modern words – a cognitive phenomenon. This is why no bond exists between cognitive linguistics and corpus linguistics. Language as a social phenomenon is laid down in texts and only there. If we, as corpus linguists, wish to find out how a text is understood, we have to ask the listeners for paraphrases; these paraphrases, being texts themselves, again become part of the discourse and can become the object of linguistic analysis. The difference between cognitive linguistics and corpus linguistics lies in how each deals with the unique property of language to signify. Any text element is inevitably both form (expression) and meaning. If you delete the form, the meaning is deleted as well. There is no meaning without form, without an expression. Text elements and segments are symbols, and being symbols, linguistic signs, they can be analysed in principle under two aspects: the form aspect or the meaning aspect. The consequence of this stance is that the only way to express the meaning of a text element or a text segment is to interpret it, that is, to paraphrase it. This is the stance of hermeneutic philosophy, as opposed to analytic philosophy (cf. Keller 1995, Jäger [2000]). In cognitive linguistics, which is embedded in analytic philosophy, meaning and understanding is seen as one. Here, text elements and text segments correspond to conceptual representations on the mental level. Within this system,

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.7 (415-461)

Corpus linguistics and lexicography

however, it is not clear what the term ‘representation’ means. Does it refer to content linked with a form (what we could call presentations) or does it refer to pure content disconnected from form (what we could call ideations)? This ambiguity is of vast consequence (Janik and Toulmin 1973: 133), as presentations themselves are signs, that is, symbols, and thus need to be understood, that is, interpreted. Cognitive linguistics, however, does not tell us how this is to happen. Rather, it describes the manipulation of mental representations as a process (whereas an interpretation is an act, presupposing intentionality). Processes themselves are meaningless. It is only the act of interpretation that assigns meaning to them. Both Daniell Dennet and John Searle point out this aporia of the cognitive approach. In their opinion, the mental processes would again require a central meaner (Dennet 1998: 287f.) or homunculus (Searle 1992: 212f.) on a level higher than cognition, that is, for understanding mental representations, and the same would then apply for that level, too, and so on, ad infinitum. On the other hand, if we translate ‘representation’ with ‘ideation,’ we dismiss the assumption of the symbolic character of language. The meaning of a word, a sentence or a text would then correspond to something immaterial, something without form, formulated in a so-called ‘mental language,’ whose elements would consist of either complex or atomistic concepts, depending whether one refers to Anna Wierzbicka and the early Jerry Fodor (Wierzbicka 1996, Fodor 1975) or to the later Jerry Fodor (Fodor 1998). On a large scale, these concepts of cognitive linguistics seem to correspond to words, but the difference lies in the fact that they are not material symbols which call for interpretation, but instead they are pure astral ideation, not contaminated by any form (cf. Teubert 1999). In practice, particularly in artificial intelligence and automatic translation, this cognitive approach has failed. Alan Melby gave a plausible explanation why it was due to fail no matter which formal language had been defined for encoding the conceptual representations: “The real problem could be that the languageindependent universal sememes we were looking for do not exist. . . [O]ur approach to word senses was dead wrong.” (Melby 1995: 48) It seems that the idea behind cognitive linguistics is the transduction or translation of phrases, sentences and texts in natural language, that is, of symbolic units, into an obviously language-independent ‘language of thought’ or ‘mentalese,’ which is non-symbolic and is exclusively defined by syntax. This transduction or translation is seen as a process and does not involve intentionality. Cognitive linguistics is committed to the computational model of mind. According to this theory, mental representations are seen as structures consisting of what is called uninterpreted symbols, while mental processes are caused by the manipulation of these representations according to rule-based, that is, exclusively syntactic, algorithms. But does it really make sense to use the term ‘symbols’ for these mental representation units, just as we call words ‘linguistic signs’? On a cognitive (or



JB[v.20020404] Prn:15/02/2007; 15:06



F: BCT810.tex / p.8 (461-509)

Wolfgang Teubert

computative) level, those entities are only symbols inasmuch as a content can become assigned to them from the outside of the mental (or computational) calculus. This content or meaning, however, does not affect the permissibility of manipulations with regard to their representation. The content of a text consisting of linguistic signs, on the other hand, is something inherent to the text itself (and not assigned from the outside), a feature we can and must investigate if we want to make sense of a text. As Rudi Keller has pointed out, the symbols of natural language are suitable for and in need of interpretation (Keller 1995). What appeals to many researchers of semantics is the fact that in cognitive semantics the meaning of a text is expressed through a calculation whose expressions are based exclusively on syntactic rules, or in other words, that semantics is transformed into syntax. They take it for granted that this is possible, as they claim that both natural and formal language are working with symbols. But in natural language, these symbols need to be interpreted whereas symbols in formal languages work without being assigned a certain (external) definition. Whether a formal language, a calculus, permits a certain permutation of symbols or not has nothing to do with the meaning or the definition of these symbols, it is just a question of syntax. As early as 1847, George Boole stated: “Those who are acquainted with the present state of the theory of Symbolic Algebra, are aware that the validity of the processes of analysis does not depend upon the interpretation of the symbols which are employed, but solely upon the laws of their combination.” Richard Montague also believes in the possibility of describing natural language semantics the same way as formal language semantics: “There is in my opinion no important theoretical difference between natural languages and the artificial languages of logigicians; indeed, I consider it possible to comprehend the syntax and semantics of both kinds of languages within a single natural and mathematically precise theory. On this point I differ from a number of philosophers, but agree, I believe, with Chomsky and his associates.” (Both quotes from Devlin, 1997: 73, 117.) From the point of view of corpus linguistics, the meaning of natural language symbols, of text elements or text segments is negotiated by the discourse participants and can be found in the paraphrases they offer, and it is contained in language usage, that is, in context patterns. Natural language symbols refer not so much to language-external facts, but rather they create semantic links to other language signs. The meaning of a text segment is the history of the use of its constituents. Linguistic signs always require interpretation. Whoever understands a text is able to interpret it. This interpretation can be communicated as a text in itself, a paraphrase of the original text. The act of interpretation requires intentionality, and therefore, cannot be reduced to a rule-based, algorithmic, ‘mathematically precise’ procedure. If we see language as a social phenomenon, natural language semantics can leave aside the mental or cognitive level. Everything that can be said

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.9 (509-564)

Corpus linguistics and lexicography

about the meaning of words, phrases or sentences will be found in the discourse. Anything that cannot be paraphrased in natural language has nothing to do with meaning. In a nutshell, this is the core programme that distinguishes corpus linguistics from cognitive linguistics.

Collocation and meaning In traditional linguistics, it is rather difficult to pinpoint the difference between a collocation such as harte Auseinandersetzung (hefty discussion) and a free combination such as harte Matratze (hard mattress). In corpus linguistics, on the other hand, it is possible to trace this awareness among the members of a language community of a distinct semantic cohesion between the lexical elements of a collocation by statistic means, that is, by detecting a significant co-occurrence of these elements within a sufficiently large corpus. Before it was possible to procedurally and systematically process large amounts of language data, syntactic rules had been the only way to describe the complex behaviour of co-occurrence between textual elements (i.e., words). Such rules describe the relation between different classes of elements, for instance, between nouns and modifying adjectives. Still, syntactic descriptions such as ‘Adjective + Noun’ are not specific enough to detect collocations as distinct types of semantic relationships. Traditional lexicology fails to come up with a feasible definition for collocations that would allow their automatic identification in a corpus. To classify certain co-occurring textual elements as semantic units, that is, as collocations, it is necessary to recognise these text segments as recurrent phenomena, which is only possible within a sufficiently large corpus. Therefore, we must complement the intratextual perspective with its intertextual counterpart. By applying probabilistic methods, it is possible to measure recurrence within a virtual universe of discourse, or more precisely, within a real corpus. Collocation dictionaries in the strict sense are always corpus-based. Even so, the speaker’s competence is still needed to check statistically determined collocation candidates for their relevant semantic cohesion. The following case study aims to illustrate the potential of the corpus linguistic approach.

Case study 1: hart as collocator The collocation dictionary Kollokationswörterbuch Adjektive mit ihren Begleitsubstantiven (Teubert, Kervio-Berthou and Windisch [in preparation]), which is currently being compiled at the Institut für Deutsche Sprache, is based on the IDS corpora of about 320 million words. The 400 adjectives were mainly selected from basic vocabulary lists. Candidates for collocations were combinations of adjectives



JB[v.20020404] Prn:15/02/2007; 15:06



F: BCT810.tex / p.10 (564-640)

Wolfgang Teubert

and nouns showing a significantly higher frequency than the expected frequency based on the occurrence of the relevant single words. The occurrences are ranked according to significance: their overall frequency, thus, have no principal influence. The concept for the statistic procedures applied here was designed by Cyril Belica. It is up to the competent speaker to decide whether a sufficient lexical cohesion can be seen in the collocation candidates detected by the computer. Manually selected citations are provided in order to facilitate this interpretation. If a collocation candidate is translated into a foreign language as a whole instead of a word-by-word translation, this can be seen as evidence of a distinct semantic cohesion; therefore, we have added the English translation equivalents to our German examples. The example below covers rank 1–10 [for an explanation of the abbreviations see http://www.ids-mannheim.de/kt/cosmas.html]: Kern Rank: 1 Frequency: 63 WKB In der Treuhand selbst hat sich ein harter Kern aus früheren SED-Betonköpfen eingegraben. WKB Dennoch enthalte der Bericht einen “harten Kern an Wahrheit.” H68 Die “Kommandoebene,” der harte Kern der RAF, umfaßt 25 bis 30 Mitglieder. H87 Der “harte Kern” umfaßt 187 Personen. H87 [. . . ] ein sicherer Hinweis, daß sich die Betreffenden dem harten Kern der RAF angeschlossen haben. H87 140 eingeschriebene Soulmänner kamen regelmäßig, ein harter Kern von 50 Jugendlichen fast täglich. (Engl.: diehards/ hard core) Arbeit Rank: 2 Frequency: 94 WKD In harter Arbeit haben wir unseren Staat aufgebaut. (Überschrift) WKD Aber wir haben eben in dieser harten Arbeit alle noch ein bißchen zu lernen. H85 [. . . ] Risikobereitschaft und harte Arbeit sollen sich in Malaysia wieder lohnen. H86 Mangelnde persönliche Ausstrahlung machte er durch harte Arbeit, eiserne Disziplin und Willensstärke wett. WKD Ein Sommer härtester Arbeit steht bevor. H85 Die Technik macht es möglich, den Menschen von harter und übermäßiger Arbeit auch zeitlich zu entlasten. (Engl.: hard work) Währung Rank: 3 Frequency: 40 WKB Die Deutschen würden nicht nur durch eine harte Währung vereint. WKB Harte Währung soll mangelnden Geist wettmachen. WKD Doch wundersam ist die Umwandlung der Ostmark in harte Währung allemal. H87 Dann wäre es endgültig vorbei mit dem Glanz der einst härtesten Währung der Welt. (Engl.: hard currency) Schlag Rank: 4 Frequency: 24 BZK Das war für ihn ein harter Schlag. MK1 Ich habe eine junge Mannschaft, die einen harten Schlag verkraften kann, ohne zu zerbrechen. MK2 Es war ein harter, gezielter Schlag, der mich prompt von den Beinen holte. (Engl.: heavy blow) Drogen Rank: 5 Frequency: 20 H88 Außerdem sei ein immer stärker werdender Trend zu harten Drogen zu beobachten. H87 Kontakt zu harten Drogen hatte der Jugendliche bald bekommen [. . . ] (Engl.: hard drugs)

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.11 (640-672)

Corpus linguistics and lexicography

Kritik Rank: 6 Frequency: 34 H86 Aber sie erfuhren schon damals von vielen Seiten harte Kritik. MK2 Harte Kritik am Biedenkopf-Plan. (Überschrift) H88 Zugleich übte er harte Kritik an der Landesregierung [. . . ] (Engl.: harsh criticism) Bandagen Rank: 7 Frequency: 12 H86 Beide Seiten schlagen derweil mit harten Bandagen zu [. . . ] WKB Der Kampf um Berlin als Hauptstadt wird mit harten Bandagen geführt. (Überschrift) (Engl.: taking one’s gloves off ) Kampf Rank: 8 Frequency: 30 MK1 Amerika müsse notfalls auf einen langen harten Kampf vorbereitet sein. H86 Die meisten sehen zu, daß sie im harten Kampf um die Zehntel für sich das Beste rausholen. BZK Verkaufsförderung gewinnt immer mehr Bedeutung im harten Kampf um die Gunst der Verbraucher. WKD Für sie geht es jetzt nicht einfach um einen harten Kampf um Arbeitsplätze. (Engl.: close fight) D-Mark Rank: 9 Frequency: 22 WKB Dann bekämen die DDR-Bürger harte DMark in die Hand und würden drübenbleiben. WKB Nichts hat Vormarsch und Endsieg der harten D-Mark aufhalten können. WKD Die harte D-Mark dient als Schmiedehammer. (Engl.: strong Deutschmark) Worte Rank: 10 Frequency: 25 H85 Harte Worte - Berliner Verhältnisse? H86 Selbst Außenminister Shultz benutzte harte Worte. H85 [. . . der] erste Vorsitzende der Gesellschaft, findet nicht minder harte Worte, um den Bruch zu begründen [. . . ] (Engl.: bitter words)

Discourse and meaning One of corpus linguistics’ most essential tenets is the assumption that the meaning of text elements and segments can be found solely in discourse. This assumption makes sense if we call to mind that in principle, every word or combination of words was once a neologism. Neologisms are introduced to the discourse by explicitly assigning certain meanings to new expressions, that is, by paraphrasing what a new word is supposed to mean. As stated above, we can determine meaning in two ways: by paraphrase and by usage. Neologisms, however, still lack the usage. They only become used once other participants of the discourse start using them either by accepting the proposed meaning or by negotiating the meaning by offering a new paraphrase. This also applies to those cases where a new meaning is assigned to an already existing word. It is obvious that we cannot go ‘back to the roots’ for all our established vocabulary; also, this is not how children learn the meaning of words. But even so, it is not simply the usage of words that leads to their meaning. In most cases, an act of explanation, very often by the parents, but sometimes also through picture-books, sets the starting point for language acquisition. Obviously, deictic references to reality (or images thereof) are of highest importance, but they are not



JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.12 (672-724)

 Wolfgang Teubert

understood without narrative explanations of words that describe what we have to watch out for in reality (or in images of reality). The meaning of school, for instance, cannot be explained by pictures of the building, classroom, teachers or pupils. In fact, only very few words relate to images unambiguously. Picture-book texts play a more important role with regard to the acquisition of word meanings than dictionaries. Since the times of the German lexicographers Adelung and Campe, the basic principle of German lexicography had been the assumption that the meaning of words can be found in text samples, a basic principle also for corpus linguistics. Nevertheless, corpus linguistics differs from traditional lexicography in various details. Firstly, corpus linguistics does not use corpora merely for examples: it explores them systematically. Secondly, corpus linguistic does not try to decontextualise the objects it describes. In other words, it does not abstract the meaning from the context. Thirdly, corpus linguistics tries to capture different usages in their correlation to different contexts, unlike traditional lexicography which tries to position word meanings upon a blueprint of a language-independent ontological concept (for instance, by genus proximum and differentia specifica). Fourthly, corpus linguistics is less interested in the single text element or word than in the semantic interaction between text elements and context. The following case study of Globalisierung [globalisation] aims to demonstrate that it is indeed the discourse (or in other words: our corpus) where information about the meaning of words can be found. The reason why we all seem to know the meaning of Globalisierung as it is used currently is the fact that we all have read those texts that explain Globalisierung. We cannot depict Globalisierung, any more than we can point at it. In its current use, Globalisierung is certainly a neologism. It is characteristic for the introductory phase of new words that the first citations show a large number of paraphrases, a fact that demonstrates the role of the discourse participants in negotiating meaning.

Case study 2: Globalisierung Globalisierung (Engl.: globalisation) as a non-lexicalised derivation has been, for a long time, part of our vocabulary. Its semantic vagueness is indicative of its non-lexicalised status. As nomen actionis or nomen resultativum, it has long been nothing more than the nominalisation of globalisieren. The presence of descriptive attributes is significant for its lack of semantic specification: metalingual indicators (like paraphrases), on the other hand, are almost totally absent. The following examples were found in the German daily Tageszeitung:

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.13 (724-772)

Corpus linguistics and lexicography

Die Vorstellung [. . . ] der Globalisierung der Kleistschen Verzückung [. . . ] scheint mir denn doch eher märchenhaft. [14.10.89] Aber die Globalisierung von Politik, Ökonomie und Technologie dulde keinen partikularen Bezugspunkte mehr [. . . ] [05.06.92] Mit der Globalisierung der Lebensweise der modernen Zivilisation geht die Selbstaufhebung der [. . . ] Ideale und Grundüberzeugungen einher. [25.02.95]

As a neologism, Globalisierung manages to almost completely displace the original, non-lexicalised derivation only as late as in 1996. Suddenly, there is a distinct rise in frequency: whereas we have only about 160 citations from 1988 to the end of 1995, there are about 320 citations for 1996 alone. Also, most citations come without descriptive attributes: apparently, it is no longer necessary to explain what is being globalised. Finally, many citations show metalingual indicators (below printed in italics) that demonstrate how the discourse participants take part in assigning a meaning to the word, namely, the following examples: Die “Globalisierung” – ein etwas unscharfer Begriff, mit dem zugleich die Ausweitung des Handels, die Liberalisierung der Finanzmärkte, der Sieg der Freiheitsideologie, die unkontrollierte Macht der multinationalen Unternehmen, die Internationalisierung des Arbeitsmarktes und die Umstrukturierung der Volkswirtschaften gemeint ist – hat die Gewerkschaften weiter geschwächt. [12.01.96] Verbissener Konkurrenzkampf im Inneren und nach außen hin eine maximale Öffnung für Kapitel, Güter und Dienstleistungen. So lautet eine der möglichen Definitionen der Globalisierung. [12.01.96] [. . . ] die Globalisierung, das heißt die vollständige Liberalisierung aller Märkte auf der Welt [. . . ] [10.05.96] Lisa Maza [. . . ] sieht die Globalisierung völlig anders: Sie sei eine Fortsetzung der Kolonialisierung mit anderen Mitteln – zum Nachteil des Südens, der Armen und der Frauen. [08.06.96] Stichwort Globalisierung: In einer globalen Wirtschaft wird es auf Dauer kein geschütztes Umfeld für die Wirtschaft irgendeines Landes mehr geben. [27.07.96] Globalisierung bedeutet auch die Europäisierung des Globus, Kolonialismus, ökonomischer und ökologischer Imperialismus. [04.05.96] Denn in der Tat bedeutet Globalisierung Amerikanisierung, und zwar nicht nur der Weltwirtschaft, sondern auch eine normative Amerikanisierung. [11.10.96] Das Stichwort “Maastricht” und das Modewort “Globalisierung” sind zu Synonymen für sozialen Rückschritt geworden. [18.10.96] Typischerweise schweigen die Intellektuellen in Deutschland beharrlich zu Europa, Globalisierung und Zukunft der Arbeit [. . . ] [13.12.96]



JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.14 (772-816)

 Wolfgang Teubert

This is a brief list of comparable English citations taken from the Bank of English and shortened: What does globalisation mean? The term can happily accommodate all manner of things: expanding international trade, the growth of multinational business, the rise in international joint ventures and increasing interdependence through capital flows. Globalisation: Low wages in other countries contribute to low wages in the United States. Words like globalisation and outsourcing are now in common use. Watkins sees globalisation as a euphemism for a race to maximise profit by lowering workers’ pay and condition. As Mr. Keegan says, globalisation means that tax cuts for business are crucial. Globalisation represents an attempt to exploit South Korea’s enormous potential. But doesn’t globalisation mean world-wide sameness? Globalisation is still more a philosophy than a business reality. Globalisation comes in many flavours.

More so than other words, neologisms show that the meaning of words is to be found in the texts rather than in some discourse-external reality. The citations – be it in their virtual entirety within the universe of discourse or be it in some crosssection in a real corpus – are the meaning, and we may understand this meaning by interpreting the citation. The formulation of a dictionary entry for globalisation, however, is the responsibility of lexicography, not of corpus linguistics, whose main task – apart from finding the references – would instead be the correlation (by systematic context analysis) of the various sets of paraphrases and usage patterns to different parameters such as text type (newspaper), genre (politics/society), ideological stance and so on. Particularly in the area of ideologically controversial keywords, it seems as if a useful selection of citations can be more helpful to the user than traditional definitions.

Linguistic knowledge and encyclopaedic knowledge Corpus linguistics aims to analyse the meaning of words within texts, or rather, within their individual context. First and foremost, words are text elements, not lexicon or dictionary entries. Corpus linguistics is interested in text segments whose elements exhibit an inherent semantic cohesion which can be made visible

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.15 (816-870)

Corpus linguistics and lexicography 

through quantitative analyses of discourse or corpus (Biber, Conrad and Reppen 1998). If the research focus is shifted from single words to text segments, the distinction between linguistic and encyclopaedic knowledge gradually becomes fuzzy. The word Machtergreifung (seizure of power), outside its context, may be described as an incident where a certain group, previously excluded from political influence, seizes the power by its own force and without democratic legitimation. However, we will interpret text segments such as braune Machtergreifung or die Machtergreifung im Jahre 1933 as referring to the ‘seizure of power by the Nazis’ without hesitation. Is this because these texts refer to a extralingual reality, to a language-independent knowledge? Although the majority of linguists would agree with this assumption, there may well be another, simpler, explanation: we have learned from a large number of citations, whenever braune Machtergreifung or Machtergreifung im Jahre 1933 is mentioned, this refers to the seizure of power by the Nazis and to nothing else. There is a co-occurrence between both expressions that may result, for instance, in an anaphoric situation: the expressions are paraphrases of each other. In the tradition of German lexicography, linguistic knowledge is separated from encyclopaedic knowledge by the process of decontextualisation, by the endeavour to describe the meaning of words unadulterated by the context in which they occur. If we detach all references from their relevant context, the isolated meaning remains. The different events of Machtergreifung that are dealt with in texts are viewed as references to a discourse-external reality. Corpus linguistics, on the other hand, above all is interested in the meaning of textual segments displaying a distinct semantic cohesion. Machtergreifung im Jahre 1933 is such a segment, and by projecting it upon our discourse (i.e., linguistic) knowledge, we are able to interpret it as ‘Nazi seizure of power’ without problem. If we are no longer limited to single words detached from their contexts and if we do away with decontextualisation, we can give up with the distinction between linguistic and encyclopaedic knowledge. For what we normally call encyclopaedic knowledge is in fact nothing but discourse knowledge. Everything we know and are able to know about the Nazi seizure of power is based on texts. Although some may even have witnessed one relevant incident or the other, their ability to interpret the whole course of events as Machtergreifung is also based on texts from other persons. If we reduce encyclopaedic knowledge to discourse knowledge, the distinction disappears. Let us take a look at the example klassische Rollenverteilung (traditional role allocation) (Spiegel 13, 1999: 128): Ein Zuhause wie ein Bilderbuchideal. Hier [. . . ] ist die klassische Rollenverteilung die Regel: Ein Elternteil kümmert sich um Haushalt und Kindererziehung, der an-

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.16 (870-932)

 Wolfgang Teubert

dere verdient das Geld. Auch dieser traditionellen Familienvorstellung entspricht das Leben im Reihenhaus. [A home like a picture-book cliché. Here [. . . ] the traditional role allocation is still the rule: one parent takes care of the household and of bringing up the children, the other parent earns the family income. Also living in a terraced house contributes to this traditional image of family.]

Within the context of family/home, the meaning of the collocation klassische Rollenverteilung in the above example corresponds exactly to the sentence that may serve as definition: Ein Elternteil. . . [One parent. . . ]. Note the sublime subversive touch that is present here, characteristic of so many Spiegel articles: what seems to be a generally acceptable definition, actually shows an essential deviation from the traditional meaning of klassische Rollenverteilung – it does not distinguish between male and female. The above example aptly illustrates challenges and achievements of corpus linguistics. Firstly, it is not interested in the meaning of isolated words outside their relevant contexts, but instead in the meaning of semantically connected text segments, extracted from discourse or, in practice, from the corpus. In the context of home and family, klassische Rollenverteilung can be interpreted in different ways with regard to period and genre. If the above Spiegel-definition becomes the accepted thing, we may apply the term klassische Rollenverteilung even to gay or lesbian partnerships. For corpus linguistics, this implies a dynamic view of meaning. Every new reference may add to the meaning of a certain text segment; older meanings may fall into oblivion if they are not sanctioned by new evidence. The above example also shows that the ways in which meaning can be negotiated within the language community can be controversial indeed. It is not so long ago when lesbian partnership and family were two different meanings that could not be imagined, let alone used, synonymously. Corpus linguistics may thus serve as a useful instrument to detect changes of meaning that are essential to neology. Secondly, corpus linguistics is developing devices for the identification and extraction of potentially metalinguistic elements of citations, that is, of text elements co-occurring with a paraphrase, thus enabling the automatic extraction, processing and presentation of semantically relevant material from corpora. Phrases such as something is the rule; x means y; this is to say; we understand it as; it can be said etc. point to metalingual content. If the meaning of a semantically controversial textual segment is negotiated, we often find indicators such as: some time ago; in fact; strictly speaking; without doubt; wrongly etc. These indicators can give us important clues. Above all, it must be realised that just as the meaning of a text segment is a paraphrase found in earlier citations, peoples’ interpretations are also paraphrases and therefore part of the discourse. In principle, the meaning of a text element or a text segment is everything that has been said about it, in terms of a paraphrase

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.17 (932-974)

Corpus linguistics and lexicography

or as a matter of usage; it is the result of the negotiation of the meaning within the discourse community. Indeed this is the difference between natural language words and technical terms. Technical terms are defined by experts, and their meaning is restricted to that definition (and thus, is discourse-external). For instance, if a tree meets the criteria for elm-trees listed in the expert’s definition, it is rightly called an elm-tree no matter what the citations say. Any terminological definition is – at least in principle – an algorithmic instruction for the usage of the relevant term. This explains why it is possible to automatically translate technical texts, provided they are monosemous and only use specialist vocabulary. Lexicographic definitions, on the other hand, are interpretations of citations, that is, results of intentional acts. They cannot automatically be processed from corpus citations, because every citation can be interpreted in various different ways. Therefore, an automatic translation of general language texts is not feasible. Thirdly, corpus linguistics uses the context to distinguish between usages. For example, the collocation klassische Rollenverteilung is not only found in the family context but also at work or in society in general. Its meaning differs according to on the context. Fourthly, corpus linguistics is interested in larger units of meaning, namely, in text segments. The traditional lexicographic practice of decontextualisation and isolation of single words impedes us from knowing the meaning of larger units such as klassische Rollenverteilung. As a rule, the meaning of text segments such as multi word units, collocations or set phrases is far more specific than that of single words. The reason why traditional linguistics is focussing on the single word, isolated from its context, can only be explained by space constraints in the past, as it is impossible to list all collocations and set phrases even in a dictionary consisting of several volumes. But is klassische Rollenverteilung really a true collocation? Is corpus linguistics really able to provide a credible validation of semantic cohesion? Is the co-occurrence klassische Rollenverteilung more than a mere addition of klassisch and Rollenverteilung? In a sufficiently large corpus, if the frequency of klassische Rollenverteilung differs significantly from the statistically expected frequency of this combination, this can be seen as one sign for possible collocation. Another sign would be the occurrence of a special meaning that can not be derived from the sum of the individual meanings of the text elements. For instance, if we find six tokens of klassische Rollenverteilung within the corpus although we would only expect three, given the frequency of the constituents, and if they all suggest that one parent is the wage-earner whereas the other is bringing up the children, then we may regard this co-occurrence as collocation. Finally, corpus linguistic considers meaning as a feature of language, of text elements, segments, and texts, and not as an external feature existing only in the human mind or in reality. The meaning of klassische Rollenverteilung in the context of family is represented in texts, and only there; it is not the reflection of a non-



JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.18 (974-1027)

 Wolfgang Teubert

textual external reality that we could point our fingers at. There is no meaning outside language, outside the discourse. We know what globalisation means today, because we have read the texts that explain it, but we cannot see globalisation.

Multilingual corpus linguistics When translating a text into another language, we paraphrase the source text. The translation represents the meaning of the original text just like a paraphrase within the source language. Translation requires understanding and thus intentionality. Only if we understand a text can we interpret or even paraphrase it. This implies that different translations will yield different versions of the same text, which again shows that translation or paraphrasing cannot be reduced to algorithmic procedures. The universe of discourse, containing all texts ever translated along with their translations, is the empirical base for multilingual corpus linguistics. It is a virtual universe, and it can be realised by multilingual parallel corpora (or a collection of bilingual parallel corpora). Parallel corpora consist of source texts along with their translations into other languages, whereas reciprocal parallel corpora contain the source texts in two languages along with their translations into the target languages. Just as in monolingual corpus linguistics, meaning is also seen as a strictly linguistic (or better, textual) term here. Meaning is paraphrase. The entire meaning of a text segment within a multilingual universe of discourse is enclosed in the history of all translation equivalents of the segment. The translation unit, that is, the text segment completely represented by the translation equivalent, is the base unit of multilingual corpus semantics. Translation units, consisting of a single word or of several words, are the minimal units of translation. If they consist of several words, they are translated as a whole and not word by word. Therefore, translation equivalents correspond to the text segments of monolingual corpus linguistics. Within the framework of multilingual corpus linguistics, we take that the meaning of translation units is contained in their translation equivalents in other languages. This corresponds to the base assumption of corpus linguistics, which does not regard semantic cohesion as something fixed but as belonging to a large spectrum reaching from inalterable units to text segments whose elements can be varied, expanded or omitted. Identifying these translation units (or text segments) again involves interpretation. The translation shows us whether a given co-occurrence of words is a single translation equivalent or a combination of them, that is, merely a chain of text elements. This leads to two consequences. What can be seen as an integral translation equivalent in one target language

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.19 (1027-1094)

Corpus linguistics and lexicography 

may be a simple word-by-word translation in another. This may even be the case within a single target language, depending on the stylistic preferences of different translators. In fact, it is the community of translators (along with the translation critics) who in their daily practice decide what is the translation equivalent, just as the monolingual language community decides what is a text segment. The definition of a translation unit therefore depends both on the target language and the common practice of translation. A virtual text segment is a translation unit only in respect to those languages into which it is translated as a whole. Translation units and their equivalents are not metaphysical entities; they are the contingent results of translation acts. According to the analysis of parallel corpora, more than half of the translation units are larger than the single word – another example of how corpus linguistics may help to investigate the nature of text segments. The meaning of a translation unit is its paraphrase, that is, the translation equivalent in the target language. For ambiguous translation units, this implies that there are as many meanings to the unit as there are non-synonymous translation equivalents. If the phenomenon of meaning is thus operationalised, the meaning of a translation unit depends on the selected target language. A given translation unit in language A may have two non-synonymous equivalents in language B, but three non-synonymous equivalents in language C. Let us look at an example. The English word sorrow (a translation unit consisting only of a single word) will usually be translated into French by one of the three equivalents chagrin, peine or tristesse; the first two, chagrin and peine, are obviously synonymous in a variety of contexts. They both point at a cause for this emotion and, therefore, are sometimes interchangeable with deuil (‘loss,’ the term for the cause). Tristesse, on the other hand, is the variety of sorrow which is not caused by a special incident. In German, there are also three standard equivalents for sorrow, namely, Trauer (caused by loss), Kummer (caused by an adverse incident, intense and usually limited in duration) and finally Gram (caused by unhappiness resulting from an incident, not very intense, more a disposition than a feeling, but often of long duration). Those three German equivalents are neither synonymous with nor corresponding to the three French equivalents. By the way, the different senses of sorrow usually found in English monolingual dictionaries and thesauri corresponds to neither the French nor the German distinctions. The above example of sorrow shows that the concept of synonymy cannot be expressed in an algorithm. To call two expressions synonymous requires a prior understanding of their meaning, that is, an act of interpretation. For instance, if we look at how the Greek verb proséuchomai in the first sentence of Plato’s Republic is translated into English, we will find five different equivalents in eight different translations of this book: to make my prayers, to say a prayer, to offer up my prayers, to worship, to pay my devoirs and to pay my devotions. We, as human

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.20 (1094-1137)

 Wolfgang Teubert

beings, must decide whether we consider the Greek verb ambiguous or just fuzzy and whether the relevant equivalents can be seen as synonyms. This is something computers cannot do. The example also shows that the concept of synonymy can only be applied locally, referring to translation equivalents or text segments within a defined context. Although we may assume that Plato’s contemporary audience considered the verb proséuchomai as unambiguous within the above context, this is not the case with native speakers of English, where there is no synonymy between to make my prayers and to pay my devotions. It can be clearly seen that meaning has a dynamic quality and also that the act of translation requires intention and thus cannot be reduced to a mere procedure. We will never find the correct German equivalent for sorrow or the correct English equivalent for proséuchmai just by defining formal instructions for a machine. Before we can translate texts and their elements, we must understand them.

Multilingual corpus linguistics in practice Neither a lexicon derived from a bilingual dictionary nor the supposedly languageneutral conceptual ontologies applied within Artificial Intelligence will solve the problem of machine translation of general language texts. Meanwhile, this fact is acknowledged by the experts. Therefore, they focus on the machine translation of texts written in a controlled documentation language, which is a more or less formal language in which all technical terms are defined unambiguously along with a syntax that rejects all ambiguous expressions as non-grammatical. General language texts written in natural languages cannot be translated without interpretation. Here, multilingual corpus linguistics steers clear of this obstacle in an elegant way. Unlike disciplines such as Artificial Intelligence and Machine Translation, which are based on cognitive linguistics, it does not try to model and emulate mental processes, but instead tries to support the translator by processing parallel corpora. They contain the practice of previous human translation. In these corpora, those translation equivalents that are proven to be reliable and accepted will outweigh equivalents that have been dismissed as inadequate in the long run. If, for instance, proséuchomai is translated as to make my prayers three times out of eight, it may well be assumed that it is an accepted – albeit not the ideal – equivalent within the given context. Parallel corpora are translation repositories. They link translation units with their equivalents. As first studies have shown (Steyer and Teubert 1998), we may assume that 90 percent of all translation units along with their relevant equivalents may be found in a carefully compiled corpus of about 20 million words per language, provided that the text to be translated is sufficiently close to the corpus with regard to text type and genre.

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.21 (1137-1193)

Corpus linguistics and lexicography 

Multilingual corpus linguistics does not pretend to solve the problem of machine translation of general language. But it may help the human translator in finding a suitable equivalent for the unit to be translated more efficiently than traditional bilingual dictionaries, because it includes the context even in those cases where the translation equivalent is not a syntagmatically defined collocation but a certain textual element within a sequence. The goal is to select from among all given elements the one whose contextual profile is closest to that of the textual segment to be translated.

Case study 3: The translation into German of sorrow and grief For the two words sorrow and grief, we find three common non-synonymous German translation equivalents: Trauer, Kummer and Gram. An analysis of the contexts of all references of these German words as found in the IDS corpora, based on a method designed by Cyril Belica (see http://www.ids-mannheim.de/cgibin/idsforms/cosmas-www-client), gives us the context profiles listed below. In our example, the number of neighbouring words (i.e. span) has been restricted to 5 words on each side. The context profiles given below have been slightly edited for the sake of clarity. Context profile for Trauer: Wut, Angst, Betroffenheit, Schmerz, Tod, Bestürzung, Freude, Hoffnung, Verzweiflung, Scham; tragen, empfinden; tief, großContext profile for Kummer: Sorgen, Schmerz, Leid, Seele, Freude, Stress, Ärger, Not; bereiten, machen, gewohnt/gewöhnt sein; viel, großContext profile for Gram: Leid, Hass, Bitterkeit, Scham; sterben; gebeugt, lauter, vollIn an English-German parallel corpus we would distinguish between three translations for sorrow and grief : the first group would contain those cases where sorrow or grief is translated by Trauer; the second group where it is translated by Kummer, and finally, the third group where it is translated by Gram. For each of the above cases, we could compute a context profile similar to the ones quoted above for the German words from the IDS corpus. We may assume that the context profile for sorrow and grief, as taken from the parallel corpus, in the case of the translation equivalent Kummer, will not differ much from the context profile for Kummer extracted from the German reference corpus, apart from it being in English instead of German. Unfortunately, a sufficiently large enough English-German parallel corpus that would allow the extraction of English context profiles for German translation equivalents on the basis of recurrence is not yet available. As an alternative, I have searched the Bank of English for those instances of sorrow and grief whose contexts are similar to our context profiles for Trauer, Kummer and Gram. So far these

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.22 (1193-1256)

 Wolfgang Teubert

results are not thoroughly convincing: one reason is the different composition of the IDS corpora compared to the Bank of English which results in a clear imbalance of the German and English instances with regard to text type and genre; also, the search criteria for the English contexts have been too narrow, and last but not least, sorrow and grief along with their German counterparts Trauer, Kummer, and Gram belong to an area of vocabulary which is highly culture-specific and is almost impossible to reduce to a common denominator. Still, the following instances taken from the Bank of English show, that in practice, the approach for the detection of equivalents outlined above will function to some extent. The words in square brackets are the German equivalents of the context words contained within the context profiles. (1) Trauer So on the night of the crucifixion I place Simon in the home in Bethany of Mary called Magdalene and her sister Maria. I envision a scene in which trauma, grief, anger [Wut], and despair [Bestürzung] were all present, to say nothing of fear [Angst]. (2) Kummer She enjoys her job though it is full of stress [Stress], sorrow and never-ending challenges. (3) Gram The terrible affliction [Leid] that has fallen so suddenly upon our unhapply country fills and monopolises my thoughts. My soul is full of grief and bitterness [Bitterkeit] and hate [Hass] and vengeance.

Although matching the context of the element to be translated against the context profiles of all possible equivalents may suggest a method for the automatic selection of suitable equivalents, this only works in those cases where we have clear selection-relevant contextual information at our disposal. As stated above, this is not always the case, especially if the text element to be translated is referring to earlier instances within the same text. In these cases, we may assume that, provided the intratextual continuity is sufficiently high, the text element (sorrow or grief in our example) can always be translated by the same equivalent with regard to the target language, be it Trauer, Kummer or Gram. In most cases, whenever a word with a fuzzy, strongly context-dependent meaning appears in a text for the first time, the information needed for the specification of its meaning will be found within the context. Later instances of the word within the text often tend to omit this information as redundant. Within a text, we must find one or two references where a suitable translation equivalent is indicated by the context profile and apply the result to the other instances. This shows that it is imperative to only include complete texts in the corpus.

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.23 (1256-1314)

Corpus linguistics and lexicography

Future prospects Corpus linguistics sees itself not in opposition to but as a complement of traditional linguistics. Corpus linguistics helps to make us aware not only of the interaction between text element and context but also of text segments, that is, larger, flexible units whose elements are semantically linked in a certain way: multi-word-units, collocations, set phrases. It explains the repeated co-occurrence of text elements as a discourse phenomenon that can be explored by statistical means, and it makes those co-occurrence patterns visible by a combination of quantitative and categorial devices. The investigation of the context enables us to better cope with words displaying fuzzy meanings, words of the ‘Thespian vocabulary,’ as John Sinclair called them (Sinclair 1996), by generating context profiles as presented above on the basis of sufficiently large corpora. Especially when combining these context profiles with those citations containing a paraphrase of the meaning or aspects thereof (cf. our case study of globalisation), this may lead to descriptions of meaning enabling the user to participate in the discourse. Corpus linguistics distinguishes between text segments on the one hand and text elements embedded in context on the other, depending on how they can be described. Context profiles are only statistically defined. Within a context profile, there is no such thing as an obligatory element that is indispensable within the context of a citation. The lexical constituents of text segments, however, can be defined either as indispensable or as optional. But there is still another difference between the text element with its context profile and the text segment: the latter is defined not only on a lexical but also on a syntactic level. The collocation Kummer gewöhnt ceases to be a collocation as soon as the verb gewöhnt sein is replaced by gewöhnen: Er hatte sich an seinen Kummer gewöhnt is not a collocation. The same applies for collocations such as geheimer Kummer, Kummer bereiten, Kummer und Sorgen. If we change the syntagma or even just the word order (for example, into Sorgen und Kummer), the words lose their collocation character. During the last decades, we have witnessed a growing interest in semantic cohesion, in the special semantic relations between words within sentences and phrases, even in traditional linguistics. Among the relatively new concepts are lexical solidarities, collocations, set phrases, valency, case roles, semantic frames and scripts. They all try to demonstrate that language is more than just the assembling of context-free words using semantics-free rules. The co-occurrence patterns developed by corpus linguistics may help to clarify heuristically the concept of text segments defined by semantic cohesion. When it comes to the identification of text segments, multilingual corpus linguistics holds a privileged position. Within monolingual corpora, this identification is a gruesome task that can only be turned into an automatic procedure



JB[v.20020404] Prn:15/02/2007; 15:06



F: BCT810.tex / p.24 (1314-1377)

Wolfgang Teubert

by a painstaking combination of various procedures based on frequencies, lists or rules. The use of parallel corpora makes it easier to identify text segments (as translation units or equivalents), as they are the true practical results of interpretation and paraphrase. They show what usually takes place within the minds of the speakers without leaving their traces in texts. Parallel corpora, therefore, provide direct access to the translation practice of human translators. If we assume that we may find the meaning of a textual element through its paraphrase, which is also a text, then we may describe parallel corpora as repositories for such paraphrases. Obviously, dictionaries also attempt to list those paraphrases. However, since their size is limited, they need to decontextualise and isolate the lexical units, whereas the paraphrases of translators display the text elements embedded within their contexts, along with whole text segments. Parallel corpus evidence helps us to trace the phenomenon of semantic cohesion. Meanwhile, with the availability of large corpora and improved software for their exploration, corpus linguistics has become part of general lexicography. Linguistics is gradually becoming more interested in larger units of meaning and the use of context for their definition. Also, it is generally accepted that the next generation of dictionaries, both monolingual and bilingual, needs to be corpus-validated, if not entirely corpus-based. But there is more to the corpus linguistic approach. By interactive procedures, the ambitious user should be able to have direct access to corpus evidence instead of being confronted with the subjective findings provided by lexicographers. Such a corpus platform would allow the members of the language community to participate in the social activity of negotiating meanings in a committed and informed way.

References Biber, Douglas; Conrad, Susan; Reppen, Randi. 1998. Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press. Collins COBUILD. 1987. English Language Dictionary. Editor in Chief: John Sinclair. Deacon, Terrence W. 1997. The Symbolic Species. New York: Norton. Dennett, Daniel C. 1998. “Reflections on Language and Mind.” In: Peter Carruthers/ Jill Boncher (Eds.): Language and Thought. Interdisciplinary Themes. Cambridge: Cambridge University Press, 284–294. Devlin, Keith. 1997. Goodbye, Descartes. New York: Wiley. Fodor, Jerry A. 1975. The Language of Thought. New York: Crowell. Fodor, Jerry A. 1998. Concepts. Where Cognitive Science Went Wrong. Oxford: Clarendon Press. Hellmann, Manfred W. 1992. Wörter und Wortgebrauch in Ost und West. Vol. 1–3. Tübingen: Narr. Herberg, Dieter; Steffens, Doris; Tellenbach, Elke. 1997. Schlüsselwörter der Wendezeit. WörterBuch zum öffentlichen Sprachgebrauch 1989/90. Berlin: Walter de Gruyter.

JB[v.20020404] Prn:15/02/2007; 15:06

F: BCT810.tex / p.25 (1377-1474)

Corpus linguistics and lexicography

Heringer, Hans Jürgen. 1999. Das höchste der Gefühle. Empirische Studien zur distributiven Semantik. Tübingen: Stauffenberg Verlag. Jäger, Ludwig. 2000. “Die Sprachvergessenheit der Medientheorie. Ein Plädoyer für das Medium Sprache.” In: Werner Kallmeyer (Ed.): Sprache und neue Medien. Jahrbuch 1999 des Instituts für Deutsche Sprache. Berlin/New York: de Gruyter, 9–30. Janik, Allen; Toulmin, Stephen. 1973. Wittgenstein’s Vienna. New York: Schuster & Schuster. Keller, Rudi. 1995. Zeichentheorie. Tübingen: Francke. Kjellmer, Göran. 1994. A Dictionary of English Collocations. Based on the Brown Corpus. Oxford: Clarendon Press. Lenz, Susanne. 2000. Studienbibliographie Korpuslinguistik. Heidelberg: Groos. McEnery, Tony; Wilson, Andrew. 1996. Corpus Linguistics. Edinburgh: Edinburgh University Press. Melby, Allen K. 1995. The Possibility of Language. A Discussion of the Nature of Language with Implications for Human and Machine Translation. Amsterdam: John Benjamins. The Oxford-Hachette French Dictionary. 1994. French-English/ English-French. Marie-Hélène Corréard, Valerie Grundy (Eds.). Oxford: Oxford University Press. Pinker, Stephen. 1994. The Language Instinct. New York: William Morrow. Pinker, Stephen. 1999. “Regular habits. How we learn language by mixing memory and rules.” In: Times Literary Supplement, October 29, 1999, 11–13. Renouf, Antoinette (Ed.). 1998. Working with Corpora. Selected Papers from the 18th ICAME Conference. Amsterdam: Rodope. Le Robert & Collins. 1993. Dictionnaire Français–Anglais/Anglais–Français. 4th Edition. Editor in Chief: Beryl S. Atkins. Searle, John R. 1992. The Rediscovery of the Mind. Cambridge, Mass.: The MIT Press. Sinclair, John M. 1996. “The Empty Lexicon.” In: International Journal of Corpus Linguistics I(1): 99–120. Steyer, Kathrin; Teubert, Wolfgang. 1998. “Deutsch-Französische Übersetzungsplattform. Ansätze, Methoden, empirische Möglichkeiten.” In: Deutsche Sprache 4(97): 343–359. Stubbs, Michael. 1996. Text and Corpus Analysis. Oxford: Blackwell. Teubert, Wolfgang. 1999. In: Modelle der Übersetzung – Grundlagen der Methodik. Frankfurt/M.: Lang, 118–135. Teubert, Wolfgang; Kervio-Berthou, Valérie; Windisch, Eric. To be published. Kollokationswörterbuch Adjektive und ihre Begleitsubstantive. Wierzbicka, Anna. 1996. Semantics. Primes and Universals. Oxford: Oxford University Press.



JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.1 (47-121)

Analysing the fluency of translators ´ Rafał Uzar and Jacek Walinski

The paper discusses problems involved in analysing the quality of student translation and the type of errors made by students in translation. The authors have developed a TEI-lite conformant corpus of student translations which also includes error category mark-up. This project has allowed the authors to objectively analyse student translation work and has also allowed the students themselves to gain valuable insights into translation problems.

. . . there is an intrinsic reciprocal relationship between research in language acquisition and developments in language teaching on the one hand, and language testing on the other. (Bachman 1991: 2)

.

Introduction

The PELCRA project (a joint venture funded by the British Council between the University of Łód´z, Poland and the University of Lancaster, UK) has targeted several interest areas which it has been slowly working on from the inception of the project one and a half years ago. One of our primary goals is to collect and process a ‘large’ corpus of natural Polish in the hope of constructing a Polish National Corpus for use mainly in contrastive studies together with the BNC and other national corpora (CNC, etc.). The second – more fruitful – corpus project to stem from PELCRA is the ‘learner English’ corpus which is used primarily for language teaching and learning. The last piece in the PELCRA jigsaw is the translation element which is to be used for translation studies and the translator training programme that we run at the Department of English in Łód´z. We have combined work in the analysis of fluency in learner English with the problems that our students of translation have together with an analysis of the ‘fluency’, appropriacy, and quality of student translations. Thus, our point of departure was the translation part of the PELCRA project and the possibility of objectively evaluating translations produced by our students. Figure 1 illustrates PELCRA’s three elements and the focus of this paper.

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.2 (121-159)

´  Rafał Uzar and Jacek Walinski

Figure 1. The PELCRA project corpora breakdown

The learner and translation corpus components of PELCRA are of maximum interest in our discussion of the fluency of translators. Here we examine this term and how we have managed to combine the learner and translation corpora elements in an analysis of fluency.

. Fluency and accuracy . . . part of the fascination of an investigation of fluency, lies in the fact that the word itself appears to mean many things to many people and is bandied about with an ease and confidence which seem wholly unjustified when individuals are invited to define their terms even a little. (Leeson, R. 1975: 2)

Our aim has been to discover a method whereby we can ascertain how ‘fluent’ a given translation might be, a system by which we can determine the quality of a translation in relation to other similar translations and the native language which in this case was English. The great obstacle here is in establishing a working definition for the terms fluency and accuracy as these are the focus of our evaluations. What is fluency? How can we quantify it? How can we test it? How can we be sure this is a good gauge? The second problem was how to illustrate this term to our translator trainees and justify the claim that one translation can have a higher level of apparent fluency than another. Generally speaking, it is easier to pinpoint a learner’s grammar or appropriacy; but once we talk of fluency, heads begin to shake, and students of English tend to get lost. Everybody knows what fluency means, but defining this precisely and explaining to a learner what is required of them is more difficult. All native speakers know what fluency feels like, but very few can explain it. Fluency has much in common with the elusive native speaker competence which is more often talked about or around rather than described.

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.3 (159-200)

Analysing the fluency of translators 

Figure 2. Fluency/Accuracy vis à vis writing questionnaire results

In order to help us towards pinning down these terms and perhaps defining them, a short pilot questionnaire was given to subjects directly involved with EFL, for example, professional translators, English school teachers, departmental lecturers, English language students from both the department and private language schools. In total, 25 questionnaires were collected. The results are (Figure 2). A variety of questions were asked; however, for the purposes of our study, we focused on what subjects thought fluency and accuracy to be only in the context of written language, for example, “is fluency the ability to write quickly?”. It seemed that writing ‘with style’ and ‘with no mistakes’ characterised people’s concept of fluency; therefore, these points were incorporated into our methodology. As with all corpus work, we attempted to map the performance strategies of the individuals involved. This was undertaken through an analysis of translation performance which in turn would lead further in the construction of an idea of learner fluency and competence.

. The methodology The translation students of the English department at Łód´z were asked to produce as professional a translation as possible of two texts from Polish (their mother tongue) to English (FL) and then to store them electronically. The students had exactly one week to complete the task using all and any resources available to them. Each text was different in genre so that we would also have a slightly wider linguistic range and perhaps a more reliable test of the student’s ability in translation. This meant that we were then left with a large number of translations of the same text. This was the first ‘experimental’ step in creating a new element in PELCRA, a corpus of parallel learner translations. This would then allow us to compare individual translations with other similar translations and also with the corpus average and see how far each individual text strayed from the corpus

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.4 (200-258)

´  Rafał Uzar and Jacek Walinski

Figure 3. The learner translation corpus

median. This mini-corpus was a sum representation of the performance strategies of our translation students. . The aims Before beginning, we defined four aims that we wanted to fulfil through the compilation and application of the ‘learner translation’ corpus: Objectivity: Objective techniques for translation evaluation are needed. By having a corpus of similar texts, we were able to evaluate students’ relative performance when translating into the FL. This allowed us to undertake the following: a) assess the general level of our translation students b) assess and compare between particular groups c) assess individual translators Equivalence multiplicity: With the ‘learner translation’ corpus resource, we were able to show prospective translators how to render the same idea from the SL into the TL (FL) in many different ways. This in turn gave us the opportunity to compare individual items across the whole corpus and grade different ways of translating a given item. Corpus linguistics: By producing such a corpus and making it available to the very same students (without any knowledge of corpus linguistics) who had given us their translations, we could make them aware that they have another resource available to them as translators, the corpus in its many guises. Error analysis: Finally, we could highlight errors typical for Polish students writing in English so that these students could avoid them in future translation work. The students were presented with two short texts. The first was an extract about EU integration taken from LOT’s in-flight magazine, “Kaleidoscope.” The second was an introductory extract about the EU and the Phare project taken from the European Commission’s document on the Phare programme. These particular

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.5 (258-313)

Analysing the fluency of translators 

texts were used because both English and Polish versions have been made publicly available and translated by ‘allegedly’ professional translators. The subject matter therefore was similar (i.e., the EU); however, the genres were very different. The LOT text target audience was undoubtedly less formal whereas the Phare text was for more discerned readers in the business of trying to acquire funding from the European Union and becoming acquainted with the workings of Phare. . Problem areas Translating into a FL is more difficult; therefore, it is no surprise that our translation trainees have problems translating into English. In the proceedings of TALC 96, Granger and Williams produced some interesting research in error classes in their work with learner corpora error analysis presented in the pie-chart below. Although these results are not fully adequate for our purposes, they do tell us the huge importance of lexis and grammar for the FL learner along with register, form, and style. Lexis and grammar can be easily dealt with in the initial stages of learning (let us say beginners to FCE level). However, style and register remain problematic well into the higher levels of language learning (advanced to proficiency). It is also generally assumed that students of different levels make

Figure 4. Error category breakdown (adapted from Granger and Williams, TALC 1996) Error Class Lexical/grammatical Stylistic L1-motivated Miscellaneous/unclass

FCE Sample 58.33% 31.25% 8.33% 2.08%

Prof. Sample 52.63% 36.84% 0% 10.52%

Figure 5. Frequency of different classes of error (adapted from Botley, S. and R. Uzar 1998b)

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.6 (313-369)

´  Rafał Uzar and Jacek Walinski

different kinds of errors. A comparison of errors in FCE level students and proficiency students follows in Figure 5. What is important to us here is the fact that advanced students make proportionally fewer lexical/grammatical errors. Advanced and proficiency level students do not have problems with the meanings of individual words or basic English grammatical structures; however, these students do have problems characteristic for their level. These we have targeted as being: – – –

style and register idiom grammar (elaborate constructions).

. Annotation In order to later retrieve information valuable to our qualitative analysis of the translations, the data needed to be enriched using tags. This was included so that we could then use the tags as target points in later frequency lists and concordancing work. We introduced three levels of annotation to the corpus. .. The first level We incorporated the encoded name and the group of the translator into the corpus header so that we could later evaluate and compare between individuals and groups. The names of the students were erased from the translation and substituted with letter-number mnemonics to avoid any personal prejudice when evaluating them. The respective names could be retrieved, once the evaluation had been finished. .. The second level All texts were marked with sentences and paragraphs to make our corpus TEI-Lite conformant. .. The third level As native speakers of Polish and English, we identified thirty potentially problematic elements in the texts, called ‘hot spots’ which we marked (tagged using angle brackets) sequentially from 1 to 29 adding the specific problem area, that is, idiom, grammar, style to the tag. These hotspots ranged from single words through phrases to whole sentences. Thus we used: idiom related tags – <1id> WORD grammar related tags – <2gr> PHRASE style and register related tags – <3st> SENTENCE

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.7 (369-556)

Analysing the fluency of translators

By applying these tags, we were able to retrieve two kinds of important information: a) all 76 translations of a particular item, for example, the 13th hotspot; b) all translations of items belonging to the same linguistic category, for example, all idiom-related hotspots. .. The fourth level Our next level of annotation was to mark the assumed difficulty level of a particular hotspot on the basis of it being either high or low. .. The fifth level The fifth element in the mark-up of our corpus was to tag the register of a particular text, that is, whether it was formal or informal. This particular tag has been added for future use so that when the corpus is expanded to contain a large variety of texts, one will be able to concordance for specific types of hotspots, for example, all idioms used in informal language. . Evaluation on the basis of comparisons Having annotated the corpus, we then moved onto grading the students’ solutions. Using WordSmith (Scott, M. 1998), we were able to gain access to all the translations of a given element at the same time and obtain a wide spectrum of solutions for a given hot spot and could then base our evaluation on the basis of direct comparisons between what different students came up with. It should be made clear at this point that we were looking at the relative quality/fluency of translations and evaluating them in the contexts of the whole corpus, that is, best to worse. We cannot exclude the possibility that better solutions/translations were conceivable; however, we were measuring the solutions given in our corpus only. As Coulthard (1983: 3) rightly says: It became evident [with the course of time and development in linguistics] that there was not in fact a uniform native speaker competence; it became necessary to talk of degrees of grammaticality or acceptability . . .

In the same way, we looked at the degrees of fluency available to the entire sample, that is, the 76 students. In order to grade our students’ output, we used a simple two-point scale for each hotspot: + a good and appropriate translation of the item – a vague or inappropriate translation of the item Following the scale guidelines, we then applied our grading system to all individual hotspots. All the grades have been incorporated into the corpus in



JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.8 (556-591)

´  Rafał Uzar and Jacek Walinski

N Concordance 1 <19#Id#Hi#For>benevolent organisations, either public or private ones 2 <19#Id#Hi#For>charity organisations, whether public or private 4 <19#Id#Hi#For>non profitable organizations; <s>public and private 5 <19#Id#Hi#For>non-financial public and private organisations 7 <19#Id#Hi#For>non-paid public and private organizations 8 <19#Id#Hi#For>non-earning public and private organizations 9 <19#Id#Hi#For>non-paid public as well as private organizations 13 <19#Id#Hi#For>non-profit making organisations, public and private 24 <19#Id#Hi#For>non-profit organisations, public and private 25 <19#Id#Hi#For>non-profitmaking organisations, public as well as private 26 <19#Id#Hi#For>non-profit organisations, public, or private 30 <19#Id#Hi#For>non-profitable private and public organisations 31 <19#Id#Hi#For>non-profit, public and private organisations 41 <19#Id#Hi#For>nonprofit organizations, both public and private 42 <19#Id#Hi#For>private and public charity organisations 46 <19#Id#Hi#For>public and private non-profit organizations 49 <19#Id#Hi#For>public and private unpaid organizations 50 <19#Id#Hi#For>public or private non-profitable organisations 51 <19#Id#Hi#For>public or private non-profitable organisations 52 <19#Id#Hi#For>the non profit, public and private organisations 53 <19#Id#Hi#For>the private and public charity organisations 54 <19#Id#Hi#For>the unpaid, both public and private, organisations 55 <19#Id#Hi#For>uncommercial, public and private organisations 56 <19#Id#Hi#For>unpaid organisations, public and private,«/19> 64 <19#Id#Hi#For>unpaid work organizations, public and private 65 <19#Id#Hi#For>unpaid, public and private organisations 68 <19#Id#Hi#For>unpaid-work organisations, public and private 69 <19#Id#Hi#For>unprofitable organisations - both public and private 72 <19#Id#Hi#For>voluntary, public and private organisations 73 <19#Id#Hi#For>unprofitable, public and private organisations 74 <19#Id#Hi#For>voluntary, public or private organisations 76 <19#Id#Hi#For>welfare, public and private organisations

Figure 6. Sample concordances of hotspot No. 19

the form of a special <Score=??> tag next to each hotspot (the sixth level of annotation) and a ‘total score’ tag in the header. . The practical application of the corpus – the results Our initial goal of evaluating students objectively gave us the following results. The assumed ideal level of translation would be 29 points (29 hot spots correctly translated with no errors). The individual results range from 13 points (the ‘best’) to –15 points (the ‘worst’ translation). The results were summed and averaged

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.9 (591-628)

Analysing the fluency of translators 

Figure 7. An example of an annotated ‘hotspot’ Table 1. Group scores Group Top Score Lowest Score Average Score

A

B

C

D

E

9 –13 –2.6

13 –15 –0.2

12 –10 0

11 –15 –3

11 –15 –3.53

between particular groups so we could evaluate their quality and apparent fluency levels. As can be seen from the table, the three third year groups (A, B, C) were on average decidedly better than the two fourth year groups (D, E). Not only did this test inform us about individual levels but it can also be applied to years as well as groups. Our initial ‘hunch’ that the third year group was better was shown to be true. This information can be fed back to the students or used to isolate teachers teaching various courses. Our second goal of making students aware of the many different ways of rendering one idea into a foreign language was fulfilled by the students’ own examples of various translations of the same item. Students when presented with a list of possible translations (Figure 6) and asked to pick an appropriate translation generally chose the best possibility and were aware why the others were not as appropriate even though the same group of students had produced weaker translations. Our effort to acquaint students with the corpus and corpus tools as a resource for translators proved successful. The reusability of the corpus has since proved valuable to other students in the department and outside of it as an aid for the analysis of translation. .. Future work a) All errors that occurred in the translations including those contained within the hot spots and those we did not anticipate will be tagged in order to fully evaluate our students’ performance with <Er> tags. A special

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.10 (628-696)

´  Rafał Uzar and Jacek Walinski

penalty, –2 points for each error will be added for errors too blatant for this level, that is, spelling mistakes. At the moment, we focus ONLY on hotspots, but later work will also focus on the text as a whole including errors outside hotspots and the general cohesiveness of a given translation. b) A five-point grading scheme to be graded by two native speakers rather than one is to be added in the near future in order to have a more reliable grading system and also as an interesting look at how different native speakers view the same text and the extent to which a given term can be termed appropriate, fluent, or accurate. The scale breaks down as follows: 2 an excellent translation ideally expressing the original meaning; 1 a very good translation of the item although it could be expressed better; 0 appropriate; –1 vague translation, the original idea was not fully conveyed; –2 inappropriate translation, original idea was not expressed. With these suggestions in mind and with the work already completed, we feel this is an excellent method for analysing the quality of translation together with the effectiveness of a particular teacher. With a fast growing corpus (we are already processing a further 80 texts from a different set of students from a different institution) and the possibility of tightening up our grading system, the analysis of translation quality and language processing has reached a new level at the University of Łód´z.1

References Bachman, L. F. 1991. Fundamental Considerations in Language Testing. Oxford: Oxford University Press. Proceedings of Teaching and Language Corpora 1996. UCREL Technical Papers. Lancaster University Press. Botley, S and R. Uzar. 1998a. “Higher quality data-driven learning through the testing of definite and indefinite articles.” Conference Proceedings of Teaching and Language Corpora 24–27 July 1998. Oxford: Oxford University Press. Botley, S. and R. Uzar. 1998b. “Investigating learner English anaphora the PELCRA way.” Proceedings of Discourse and Anaphora Resolution Conference 1–4 August 1998. Lancaster: Lancaster University Press. Coulthard, M. 1983. An Introduction to Discourse Analysis. London: Longman. Leeson, R. 1975. Fluency and Language Teaching. London: Longman.

. Should anyone be interested in more detailed results or the corpus itself, please feel free to ´ ([email protected]). contact Rafał Uzar ([email protected]) or Jacek Walinski

JB[v.20020404] Prn:15/02/2007; 15:09

F: BCT811.tex / p.11 (696-730)

Analysing the fluency of translators 

Scott, M. 1998. WordSmith: Software Language Tools for Windows. Oxford: University Press. Uzar, R. 1997. “Was PELE a linguist?” In Lewandowska-Tomaszczyk, B. and P. J. Melia, Practical Applications in Language Corpora, Łód´z, Poland 10–14 April 1997. ´ Uzar, R and J. Walinski. 1999. “A Comparability Toolkit.” In Lewandowska-Tomaszczyk, B. and P. J. Melia (eds.), Conference Proceedings of Practical Applications of Language Corpora 1999 (forthcoming). Peter Lang.

Original Texts Kaleidoscope (1998): Polish LOT airlines in-flight magazine. December 1998 edition. What is Phare? (1994): European Commission, Phare Information Office: Brussels. Co to jest Phare? (1994): Komisja Europejska, Biuro Informacyjne Phare: Bruksela.

JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.1 (45-143)

Equivalence and non-equivalence in parallel corpora* Tamás Váradi and Gábor Kiss

The present paper shows how an aligned parallel corpus can be used to investigate the consistency of translation equivalence across the two languages in a parallel corpus. The particular issues addressed are the bidirectionality of translation equivalence, the coverage of multiword units, and the amount of implicit knowledge presupposed on the part of the user in interpreting the data. Three lexical items belonging to different word classes were chosen for analysis: the noun head, the verb give, and the preposition with. George Orwell’s novel 1984 was used as source material as it available in English-Hungarian sentencealigned form. It is argued that the analysis of translation equivalents displayed in sets of concordances with aligned sentences in the target language holds important implications for bilingual lexicography and automatic word alignment methodology.

.

Introduction

One of the stumbling blocks to machine translation is that words rarely stand in stable one-to-one correspondence with each other across languages. Instead, they typically have a ramified set of senses that would be rendered with a set of different lexemes in the other language. Bilingual dictionaries can give only limited help in finding the appropriate lexical correspondences because they provide very little information as to which alternative would be suitable in the given context. In fact, as has been recently pointed out by Wolfgang Teubert (1999), bilingual equivalence between dictionary entries is very often not bi-directional. As a useful complement to bilingual dictionaries and conceptual ontologies, Teubert recommends the * The research reported in the paper was supported by Országos Tudományos Kutatási Alapprogramok (grant number T026091)

JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.2 (143-184)

 Tamás Váradi and Gábor Kiss

corpus linguistic approach which he illustrates through data from monolingual corpora. In this paper, we explore this line of research by adducing evidence from parallel corpora.

. The problem with bilingual dictionaries Even if available in electronic format, ordinary dictionaries created for human users present certain problems for natural language processing (Boguraev and Briscoe 1989). Some limitations derive from the fact that lexicographers inevitably rely on the co-operation of the readers to exploit the information compiled in the body of dictionary entries. One source of relatively low level difficulties is that all dictionaries make shortcuts in presenting the data in an effort to compress the maximal amount of information in the available space. Users are expected to decode and apply the formatting conventions that are usually set out in an introductory section. This is a task that is not always trivial to automate as was reported by the CONCEDE project for several dictionaries (Erjavec et al. 1999). More serious than this procedural difficulty are the general deficiencies in content. As was demonstrated by Teubert (op.cit.), bilingual dictionaries tend to give a list of equivalents with very little help (apart from usage notes) as to which one is to be used in the particular context the dictionary user needs. There is, furthermore, an undue preponderance of single-word equivalents at the expense of multi-word units. When one tries to look up the equivalents in the other side of a bilingual dictionary, one finds surprisingly few bidirectional equivalents. Teubert presents the situation graphically in Figure 1 through the analysis of the semantic field Trauer Kummer Gram Leid Betrübnis Schmerz Pein [Niedergeschlagenheit] [Klage] Jammer Sorge [Trübsal]

grief sorrow mourning affliction distress pain smart [suffering] [ruefulness]

Figure 1. German-English translation equivalents related to Trauer (based on Teubert 1999)

JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.3 (184-230)

Equivalence and non-equivalence in parallel corpora 

associated with the German word Trauer. The figure was arrived at by first looking up the equivalents of Trauer in Langenscheidt Enzyklopädisches Wörterbuch and successively looking up the equivalents found in one language in the other side of the dictionary until the senses started to become remote from the semantic field of the original word. (Bidirectional equivalents are printed in grey, unidirectional in black.)

. The rationale for the present work In the present paper, we intend to investigate to what extent use of parallel corpora can help to eliminate some of the difficulties noted with bilingual dictionaries. It was assumed that parallel corpora are amenable to the same procedure of traversal of translation equivalents. At the same time, the data are sufficiently different in key aspects to warrant the replication of the methodology. In particular, we set out to investigate what picture emerges from parallel corpora with regard to a) consistency of coverage (bidirectional vs. unidirectional equivalences), b) coverage of multiword units, collocations, c) the amount of user knowledge presupposed, and d) what implications are there for bilingual lexicography and NLP, for example, automatic word alignment.

. Methodological issues . Data and encoding scheme As source data, we used the sentence aligned Hungarian-English parallel corpus of Orwell’s 1984 developed in the MULTEXT-EAST project (Erjavec and Ide 1998). The corpus was processed with the IMS Corpus Workbench system (Christ 1994) developed by the Institut für maschinelle Sprachverarbeitung of Stuttgart University. We used a slightly simplified version of the corpus encoding scheme developed by Erjavec (1999) for the ELAN Slovene/English. Figure 2 shows a sample of the corpus annotation. There were only two tags used to mark up the structure of the texts: (translation unit) and <s> (sentence) with their respective id attributes. As against the relatively simple structural mark-up, the annotation attached to each token was substantially richer in content. Each line of text included the word form, lemma, corpus tag, and the morphosyntactic description, arranged in a tabular format. While the IMS Workbench Tool is somewhat limited in handling SGML tagged corpora, it offers remarkable facility in handling linguistic mark-up associated to each token. Technically, the two languages of the parallel corpus are stored in separate files, and the alignment

JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.4 (230-254)

 Tamás Váradi and Gábor Kiss

translation unit (domain of alignment)

<s id="Oen.1.1> It it PPER3 Pp3ns was be AUX1 Vais1s a a DINT Di bright bright ADJE cold cold ADJE Afp day day NN Ncns in in PREP Sp April April NN Ncns , , COMMA and and CCOO Cc-n the the DETR Dd clocks clock NNS Ncnp were be AUX Vacs striking strike PPRE thirteen thirteen CD . . PERIOD

Af

Vmpp Mc

Figure 2. The encoding of the data

Figure 3. A sample query output

is made with reference to the respective offset figures of the corresponding translation units. Figure 3 shows a sample output of a query. The query string [hH]ead.*:OHU [lemma="fej"] is a regular expression designed to retrieve all occurrences of any inflected forms of the word head (whether it begins in lowercase or uppercase) where the corresponding alignment unit in the Hungarian corpus OHU includes the lemma fej.

JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.5 (254-308)

Equivalence and non-equivalence in parallel corpora

It should be noted that because the two parts of the parallel corpus were aligned at the sentence level, it was not possible to establish automatically whether the two words in the search expressions were actually translation equivalents. All that can be stated with certainty is that the aligned sentences contained the two words in question. The corresponding parts of the aligned sentence pair had to be related manually. . The analysis We have selected three English lexemes for analysis that is, head, give, and with. By focussing on three words of so radically different parts of speech, we intended to examine whether our findings were sensitive to word class membership. In schematic form, we carried out our analysis through the following steps. (1) Find prototypical equivalents for the English words. For each of the three words, it was easy to find single uncontested candidates for this status: head = fej, give = ad, with = -val/-vel. (-val/-vel are variant forms of the instrumental suffix governed by vowel harmony.) Below we will refer to members of prototypical equivalent pairs simply as L1 word and L2 word (2) Generate three concordance sets where (1) L1 word is translated with L2 word (2) L1 word is translated with non-L2 word (3) L2 word is translated with non-L1 word To automate this step, a perl script was developed that produced the three sets from two words specified on the command line. (3) Repeat step 2) with other L1 and L2 words from 2b) and 2c) until the semantic field of the original English lexeme seems to be saturated. . Limitations of the approach Our analysis inevitably faced certain limitations owing to the scope and nature of the source data used. One immediately obvious constraint was the size of the data, which was about a hundred thousand words. While monolingual corpora do exist for Hungarian, the Hungarian National corpus currently number more than 80 million words (Váradi 1999), parallel corpora are much harder to come by even for other language pairs let alone for Hungarian-English. Another such corpus is Plato’s Republic developed as a TELRI joint research effort, but the Hungarian English alignment was not available at the time the work reported here was undertaken.



JB[v.20020404] Prn:15/02/2007; 16:03



F: BCT812.tex / p.6 (308-364)

Tamás Váradi and Gábor Kiss

Apart from size, another practical limitation was the rather limited language variety used for source data. Again, this problem could be remedied with the extension of the data not just in terms of size but also in register. A third peculiarity of the data that one must bear in mind is its inherently unidirectional nature. Although it is tempting to look at the aligned sentences from either direction, it still remains true that for any pair of sentences, one is the source and the other is the target of the translation. Hence, we simply cannot speak of ‘the translation equivalents’ of any Hungarian word in our data. This deficiency could be compensated by involving data that are translations of Hungarian source texts though this could raise all sorts of issues about how close a match there is between the two source texts. In this sense, there is hardly any genuine bidirectional parallel corpus.

. Findings . Prototypical vs. other equivalents Table 1 presents the statistical summary of our findings. The figures afford several interesting conclusions. It appears that the ‘fit’ between the actual translation equivalence and the presumed prototypical equivalents, as measured in the ratio of the prototypical cases within the total, varies with the language as well as the word class if not the individual lexeme. For example, while close to 70% of the instances of head were rendered with the expected prototypical equivalent fej in Hungarian, the same ratio for with is 54% and give is translated with ad in less than 25% of the cases. If we look at all occurrences of the Hungarian equivalents, we find the same ranking of the items in terms of the ratio of the prototypical equivalent to all other translation equivalents (fej, -val/-vel, ad) but at a higher level (80%, 44%, 38%). Recall that given the unidirectional nature of our text data, one should interpret the Hungarian figures for fej, for example, as the number of times fej was used as the translation of head vs. other English words. . A close up profile Figure 4 shows the distribution of all the translation equivalents of the word head found in our data. The figure on the left shows a complete listing of the equivalents summarized in Table 1. It traces the corresponding items in both directions at one level of depth. The graphic display of the ‘other’ variants make it immediately clear that the spread of the Hungarian translation equivalents of head is much broader than the range of words rendered as fej. Except for the last two cases, all uses of the word head made reference to the body part in a non-metaphorical sense. It is

JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.7 (364-441)

Equivalence and non-equivalence in parallel corpora

3

45 4 3

2

3 3

head mind skull face pop up 4

0

fej

’head’

bólint ’nod’ tarkó ’back of the head’ agy ’mind, brain’ eszébe jut ’come to mind’ koponya ’skull’ haj ’hair’ száj ’mouth’ arckép ’portrait’ felfigyel ’listen up’ saját maga’by one’s own effort’ osztályvezetõ ’department head’ létra vége ’head of the ladder’

head

3

mind 2

skull

45 4 3 33 3 3 6 11 11 6 5 6 7

fej bólint tarkó agy eszébe jut lélek tudat gondolat elme

’head’ ’nod’ ’back of the head’ ’mind, brain’ ’come to mind’ ’mind, soul’ ’consciousness’ ’thought’ ’mind’

koponya ’skull’ csontváz ’skeleton’

Figure 4. Tracing the translation equivalents of ‘head’ Table 1. The actual distribution of the assumed prototypical translation equivalents Head other total fej 45 11 56 other 21 total 66

give other total ad 26 43 68 other 79 total 105

with other total -val/-vel 337 412 759 other 285 total 622

interesting to note that some uses were rendered with a verb or verb phrase (eszébe jut – come to one’s head, felfigyel ‘listen up’ – raised his head). Also note the number of cases (4) where there was no English source for fej at all. The diagram on the right in Figure 4 traverses the links between translation equivalents one step further, eliminating, for clarity, all the cases with a single occurrence. Figure 5 displays some instances where the Hungarian equivalents of head (i.e., száj ‘mouth’, saját maga ‘by one’s own effort’) may seem odd when viewed purely at the lexical level. On the strength of the English examples alone, it is easy to see that there is nothing contrived about the uses involved here; yet, any lexicographer would probably feel reluctant to introduce such equivalence as head – száj ‘mouth’ in a dictionary. The richness of variety of data brought to the fore with this method is further illustrated by Figure 6 displaying the different expressions rendered as eszébe jut ‘come to mind’. It is important to note that none of the corresponding items is a single-word unit. When we turn to the verb give, we find that it is practically impossible to even describe the bare items standing in correspondence without making recourse to multi-word expressions. The data in Figure 6 clearly argues for the importance of the context in defining the units that enter into a bilingual



JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.8 (441-451)

 Tamás Váradi and Gábor Kiss

head - haj ‘hair’ : He plucked at Winston’s and brought away a tuft of hair. -->ohu: Belemarkolt Winston hajába, és kihúzott egy csomót. head -- száj ‘mouth’ : And the few you have left are dropping out of your . -->ohu: S az a néhány is, ami még megvan, kiesik a szádból. head -- saját maga : A great deal of the time you were expected to make the up out of your -->ohu: Az ember gyakran saját maga volt kénytelen kitalélni. head -- felfigyel : Winston raised his to listen -->ohu: Winston felfigyelt.

Figure 5. A sample of the translation equivalents in context give was given as give sb. the impression give a glance give his name give way to give off (smell) give one away give up trying at any given moment by a given date with no reason given don’t give a damn

kap ‘get, receive’ jelölték meg ‘was marked’ az volt az érzése ‘had the feeling’ körülnézett ‘looked around’ megmond ‘say’ következett ‘followed’ áraszt ‘ooze’ elárulhat ‘could betray’ felhagynak ‘abandon/cease to do’ mindig ‘always’ záros határidõn belül minden indoklás nélkül senkinek sem akarunk ártani

Figure 6. Translation equivalents of give

equivalence relationship. Note that without considering the context in which the translation equivalents occur, one may get paradoxical correspondences like give = kap ‘get’. Such cases can only be interpreted if the different organisation of the sentences in which they occur are also considered. In other cases, it is enough to make reference to the typical objects with which the item co-occurs. One feature in Hungarian that provides for a proliferation of equivalents in English is the presence of the coverb particle that creates new meanings of the verb stem that are often rendered in English with a separate lexeme. Examples include kiad – publish/issue, átad – hand over, elad – sell, etc.

JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.9 (451-501)

Equivalence and non-equivalence in parallel corpora

. Conclusions We do not have the space here to discuss the data uncovered by the analysis in the detail that it clearly merits. However, we are positive that the evidence presented above is sufficient to draw the following conclusions. The corpus linguistic approach advocated by Teubert has received ample corroboration from the bilingual corpus evidence presented. On the issue of the bi-directionality of equivalents, we also did not find that the set of bilingual equivalents formed a closed set. However, this may well be due to the sparseness of data used in this pilot experiment. Our findings have important implications for bilingual lexicography. The most important point to note is the vital need to integrate corpus evidence. Enriching the dictionary with contextual evidence serves to eliminate several shortcomings noted earlier. As the data is embedded in context, it will almost inevitably bring with it guidance as to how the particular item is to be used. Showing usage through examples will also obviate the need for terse and highly abstract formulations. True, the intuitions of the dictionary users are required here too, but developing intuitions through actual language data is a task that the average human user is better equipped to handle than dealing with an arid list of bilingual equivalents. More extensive direct integration of the context should also narrow the current gap between lexical and textual equivalence. We have presented numerous examples for translation equivalents that make perfect sense in the particular context they are used which, however, may seem puzzling if not downright false when viewed out of context. Any attempt to base definitions on real translation equivalence will result in more numerous use of multi-word expressions simply because most of the time it is just not feasible and certainly not practicable to tease out some single word and equate it with another one in the target language at a relatively abstract level. The data presented in the paper clearly suggests that one can only do justice to the intricate and rich texture of context if the two languages are related not at the lexical level of the word but rather at the contextual level embodied in phrases. The difficulties of pinning down bilingual equivalence on the individual words also has implications for automatic word alignment methodology. No matter how wide a window one establishes within which to scan for equivalents, as long as the search is centred on individual words, the procedure is faced with serious limitations.



JB[v.20020404] Prn:15/02/2007; 16:03

F: BCT812.tex / p.10 (501-545)

 Tamás Váradi and Gábor Kiss

References Boguraev, B. and E. J. Briscoe. 1989. Computational Lexicography for Natural Language Processing. London and New York: Longman. Christ, Oliver. 1994. “A Modular and Flexible Architecture for an Integrated Corpus Query System.” Papers in Computational Lexicography COMPLEX’94, ed. by Kiefer et al., 23–32. Budapest: Linguistics Institute, Hungarian Academy of Sciences. Erjavec, Tomaž and Nancy Ide. 1998. “The MULTEXT-East corpus.” First International Conference on Language Resources and Evaluation, LREC ’98, ed. by Rubio et al., 971–974. Granada: ELRA. Erjavec, Tomaž. 1999. “Making the ELAN Slovene/English corpus.” Proceedings of the Workshop Language Technologies – Multilingual Aspects, ed. by Špela Vintar., 23–30. Ljubljana: Department of Translation and Interpreting. Erjavec, Tomaž, Dan Tufiš, and Tamás Váradi. 1999. “Developing TEI–Conformant Lexical Databases for CEE Languages.” Papers in Computational Lexicography COMPLEX’99, ed. by Kiefer et al., 205–209. Budapest: Linguistics Institute, Hungarian Academy of Sciences. Teubert, Wolfgang. 1999. “Starting with Trauer. Approaches to Multilingual Lexical Semantics.” Papers in Computational Lexicography COMPLEX’99, ed. by Kiefer et al., 153–169. Budapest: Linguistics Institute, Hungarian Academy of Sciences. Váradi, Tamás. 1999. “On Developing the Hungarian National Corpus” in Vintar Špela (ed.): Proceedings of the Workshop Language Technologies – Multilingual Aspects, 32nd Annual Meeting of the Societas Linguistica Europea, Ljubljana, 57–63.

JB[v.20020404] Prn:14/05/2007; 10:06

F: BCT8IND.tex / p.1 (47-153)

Index

A adjective , , , ,  algorithmic procedure ,  alignment , –, , , , –, , , , , , , , –, , –,  software ,  annotation –, , , , , , –,  automatic extraction –, , , , 

B basic form ,  basic vocabulary  bilingual dictionary , , , , , , , , –, ,  bilingual lexicography , , , , , , , ,  bridge dictionary , , , , , 

C character , , , –, , , , ,  string , , ,  Chinese character ,  Cobuild , , , , , , , ,  cognitive approach  cognitive linguistics , , ,  cognitive science  cohesion , , , , , , , , ,  collocation , , , , , , , 

community , , , , , , , , , , , , , ,  comparable corpus  computational lexicography  computational linguistics , , ,  concept , , , , , , , , ,  concordance , , , , , , , , , , , ,  content , –, , , , , , ,  context , , , –, , , , , , , , , , –, –, , , , – pattern ,  profile , – contrastive , ,  contrastive analysis  contrastive information  contrastive linguistics ,  core vocabulary  corpus , , –, , , , –, , –, , , , –, , , –, –, –, , , , , –, , , –, , , –, –, , , , , –, –, –, –,  encoding  scheme  standard ,  evidence , , , , , ,  linguistics , , , , , , , , – retrieval  software 

tool  typology  corpus-based approach ,  corpus-based lexicography  Croatian , , , , , –, ,  Czech –, , , , , , – D definition , , , , , , , , , , , , , ,  determiner , , ,  dictionary , , , –, , , , –, , , , –, , , –, , , –, , , , , , –, ,  discourse , , , –, , , –,  analysis ,  community  DTD , –, , ,  E electronic dictionary , ,  empty lexicon , ,  encyclopaedic knowledge ,  English , , –, , –, –, –, –, , , , , , , , , , , , , –, , , –, –, , , –, , , , –, –, , – entropy ,  entry , , , , , , –, 

JB[v.20020404] Prn:14/05/2007; 10:06



F: BCT8IND.tex / p.2 (153-261)

Index

equivalence , , , , , –, , , , , – error , , , ,  analysis ,  category ,  European languages , , , , , , , ,  expression , , , ,  F fiction , – Finnish , , , –, – form , , –, , , , , , , , , , , –, , , , , , , , , , , , , , , ,  frame ,  French , , , , , , , ,  frequency , –, , , , , , , ,  dictionary  list  fuzzy concept  fuzzy meaning  G genre , , , , , ,  German , , , , , , , , , , , , , , –,  grammar , , , , , , , –, , ,  rule  word  grammatical information ,  grammatical pattern  H headword , , –, – human translation  Hungarian , , , , , – hybrid approach , , ,  I information theory 

intentionality , , ,  interpretation , , , , , –,  Italian –, , –, –, , , ,  J Java ,  K keyword – Kollokationswörterbuch  L language –, –, , , , , –, –, , –, , –, , , –, –, , , –, –, –, , , , , –, , ,  pair , , , , ,  structure , ,  lemma , , ,  lexeme , ,  lexical cohesion  lexical coverage  lexical database , , , , ,  lexical information , , ,  lexical item , ,  lexical level , ,  lexical meaning  lexical semantics , ,  lexical syntax  lexical word  lexicographic practice  lexicography –, , , , , , , –, , , , , , ,  lexicology ,  lexicon , , , , , , , , , , , , ,  linguistic community  linguistic computing  linguistic knowledge ,  linguistic sign , ,  linguistics , , , , , , , , , –, , , –, , , ,  Lithuanian , , , , , , , , 

M machine translation , , , , , , ,  markup , ,  meaning , , , , , , , , –, –, , , –, –, –,  metalingual content  method , , , , , , , , , , , , ,  model , , , , , , , , –, , , , , ,  modifier , , ,  monolingual corpus , ,  monolingual dictionary , ,  morphological information  monolingual lexicon – morphosyntactic ,  MULTEXT-EAST  multilingual corpus , , , ,  multilingual database ,  multilingual dictionary  multilingual document  multilingual information ,  multilingual lexical database  multilingual lexicography , , , ,  multilingual resource , , ,  multiword unit , 

N n-gram , – neologism – network , , –, ,  noun , –, , , , , –, , , , , , ,  group , , , , ,  phrase , , , , , ,  NP ,  numerus 

JB[v.20020404] Prn:14/05/2007; 10:06

F: BCT8IND.tex / p.3 (261-366)

Index 

P parallel corpus , , , , , , , , , , , , , , , –, , , , ,  paraphrase , , , , , , ,  parsing , , , , ,  pattern , , , , , , , , ,  phrase , , , , , , , , , , , , ,  grammar  structure  Plato’s Republic , , , , –, , ,  Polish , , , , , , , – post-modification , ,  pre-modification , , ,  precision rate – preposition , , , ,  prepositional phrase ,  preprocessing  probabilistic approach  probabilistic model ,  probability , , , ,  procedure , , , , , , , , , , , , , , , , , , ,  profile , –,  R record ,  reference grammar  regular expression ,  Reppen, Randi , ,  retrieval , , ,  rule , , , ,  rule-based approach  Russian , , , , , , –, ,  S science ,  semantic cohesion , , , , , , , , ,  semantic field , ,  semantic frame  semantic network 

semantic profile  semantics , , , , , ,  sentence , –, , , , , , , , , , , –, , , , , , , , , ,  alignment –, , ,  level , , ,  structure  Serbocroatian , ,  sign ,  Slovak , , ,  Slovenian , , , , ,  statistical alignment ,  statistical information  statistical machine translation  statistical method ,  statistics-based approach  structure , , , , , , , , , , , , , , , , ,  syntactic rule ,  syntactic pattern ,  syntactic structure ,  syntagmatic information  syntagmatic meaning  syntagmatic relation  T tag , , , –,  tagging , –, , ,  TELRI –, , , , , , –, , ,  term , –, , –, , , , , , , , –, ,  terminological dictionary  terminological recod , , , ,  terminology , , , , ,  text corpus , , , , , , ,  textual element , ,  textual level ,  textual resouce  textual segment ,  translation –, –, , , , , , , , , , , , , , , –,

, , , , , , , , –, –, , , –, , , –, , , , , –, , –, – equivalence , , , , , , –, , , , , ,  equivalent , , , , , , , , , , , , , , –, , , , –, –, – lexicon , –, ,  memory , ,  unit , , , , –, , ,  translator , , , , , , , , , ,  U uninterpreted symbol ,  universal grammar ,  V vanilla aligner ,  verb , , , , , , , , , , , , , ,  phrase ,  vocabulary , , , , , , , , , ,  Wierzbicka, Anna ,  word , , , , , , , , , , , , –, , –, , –, , , –, –, , , , , , –, , –, , , , , , –,  alignment , , ,  class , , ,  form  frequency  meaning  sense ,  Wordsmith software  Z Zettersten, Arne , 

In the series Benjamins Current Topics (BCT) the following titles have been published thus far or are scheduled for publication: 12 Dror, Itiel E. (ed.): Cognitive Technologies and the Pragmatics of Cognition. ca. 200 pp. Expected Forthcoming 11 Payne, Thomas E. and David J. Weber (eds.): Perspectives on Grammar Writing. ca. 230 pp. Expected September 2007 10 Liebal, Katja, Cornelia Müller and Simone Pika (eds.): Gestural Communication in Nonhuman and Human Primates. xiv, 275 + index. Expected July 2007 9 Pöchhacker, Franz and Miriam Shlesinger (eds.): Healthcare Interpreting. Discourse and Interaction. 2007. viii, 155 pp. 8 Teubert, Wolfgang (ed.): Text Corpora and Multilingual Lexicography. 2007. ix, 159 pp. 7 Penke, Martina and Anette Rosenbach (eds.): What Counts as Evidence in Linguistics. The case of innateness. 2007. x, 297 pp. 6 Bamberg, Michael (ed.): Narrative – State of the Art. 2007. vi, 271 pp. 5 Anthonissen, Christine and Jan Blommaert (eds.): Discourse and Human Rights Violations. 2007. x, 142 pp. 4 Hauf, Petra and Friedrich Försterling (eds.): Making Minds. The shaping of human minds through social context. 2007. ix, 275 pp. 3 Chouliaraki, Lilie (ed.): The Soft Power of War. 2007. x, 148 pp. 2 Ibekwe-SanJuan, Fidelia, Anne Condamines and M. Teresa Cabré Castellví (eds.): Application-Driven Terminology Engineering. 2007. vii, 203 pp. 1 Nevalainen, Terttu and Sanna-Kaisa Tanskanen (eds.): Letter Writing. 2007. viii, 160 pp.