Word Frequency Studies
≥
Quantitative Linguistics 64
Editors
Reinhard Köhler Gabriel Altmann Peter Grzybek Advisory Editor
Relja Vulanovic´
Mouton de Gruyter Berlin · New York
Word Frequency Studies
by
Ioan-Iovitz Popescu in co-operation with Gabriel Altmann Peter Grzybek Bijapur D. Jayaram Reinhard Köhler Viktor Krupa Ja´n Macˇutek Regina Pustet Ludmilla Uhlı´rˇova´ Matummal N. Vidya
Mouton de Gruyter Berlin · New York
Mouton de Gruyter (formerly Mouton, The Hague) is a Division of Walter de Gruyter GmbH & Co. KG, Berlin.
앝 Printed on acid-free paper which falls within the guidelines 앪 of the ANSI to ensure permanence and durability. Library of Congress Cataloging-in-Publication Data Popescu, Ioan-Iovit. Word frequency studies / by Ioan-Iovitz Popescu in cooperation with Gabriel Altmann ... [et al.]. p. cm. ⫺ (Quantitative linguistics ; 64) Includes bibliographical references and index. ISBN 978-3-11-021852-7 (hardcover : alk. paper) 1. Language and languages ⫺ Word frequency. I. Altmann, Gabriel. II. Title. P138.6.P67 2009 410.1151⫺dc22 2009014109
Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de.
ISBN 978-3-11-021852-7 ISSN 0179-3616 쑔 Copyright 2009 by Walter de Gruyter GmbH & Co. KG, D-10785 Berlin. All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Cover design: Martin Zech, Bremen. Printing and binding: Hubert & Co. GmbH & Co. KG, Göttingen. Printed in Germany.
Preface
The study of word frequencies, i.e. the question how often particular words, or word forms, occur in a given text or corpus of texts, is one of the favorite and most traditional issues in the history of quantifying approaches to language. Word counts, which are a necessary pre-condition for any theoretical study of lexical frequency behaviour, became popular quite early, and in any case, long before quantitative linguistics became established as a discipline in its own right. Still today, the statistical description of lexical units and structures is a very popular issue in the broad field of empirical linguistics, such as, e.g., corpus linguistics and the like. Historically speaking, such ‘simple’ word counts are known to us from the early 17th century, represented in the works by William Bathe or Jan Amos Komenský. Studies in this direction have always been mainly motivated by some kind of practical reason, rather than by an interest in theoretical analyses. Thus, the desire to improve foreign language instruction, which was the impetus for the early work, also represented a major interest in many 20th century studies aiming at minimal lexicon etc. More often than not, practical applications have motivated or accompanied frequency studies, whether in stenography, information science, psychology, pedagogy, or other disciplines, for all of which an insight into the frequency behaviour or lexical structures is of utmost importance. In fact, a quick search of modern data bases like M LA or L LBA displays hundreds of relevant studies; it is beyond the scope of this preface to discuss them in detail. In linguistics, too, word counts have always been considered an important question. The relevance to lexicology is most obvious, but studies in the field of graphemics or phonology, aiming at a frequence-oriented description of letters, sounds, or phonemes and their combinatorial behaviour, have also profited from word counts. Yet, in the history of linguistics, it took quite some time until the study of lexical frequency behaviour became a topic of theoretical research, apart from any direct practical or applicational interest. Only in the middle of the 20th century, starting with G.K. Zipf’s influential works, the statistical regularity of lexical items became a linguistic research object in its own right. As opposed to earlier approaches, Zipf also related frequency behaviour to other lexical characteristics, such as, e.g. word length. This kind of dynamic inter-
vi Preface relation between various linguistic units and levels became later integrated within the theoretical context of synergetic linguistics, providing a broad theoretical framework for quantitative linguistics. The present book is a continuation of this line of research. The basic idea is to present selected approaches to word frequency as currently developed and tested in our research group. The empirical tests of these new models were conducted on material from 20 languages, which enables first general conclusions. However, a large number of obvious questions (such as authorspecific studies, potentially genre-dependent phenomena, historical and dynamic perspectives etc.) have been left to the reader, as linguistic material for such empirical investigations is easily available. The empirical part of the work was done without using special software. Most of the data can be processed with common PC programs or with little applications programmed without expert knowledge in software development. The first seven chapters show some classical problems from a new perspective, different indicators interpreted textologically or linguistically and some new vistas on research possibilities. They concentrate on rank frequency distributions and frequency spectra, study their properties and interpret them. Chapter 8 discusses the representation of the text as a graph and shows some of its properties. Chapter 9 treats uniformly the problem of frequency distribution and its characterization. Chapter 10 shows a detail from language synergetics, viz. the relation between frequency and length. In Chapter 11, the frequency of words with respect to their position in sentences is scrutinized, and tentatively, a new unit, the F-segment, is introduced. The classical problem of type-token relation is presented in four variants in Chapter 12. A survey of results is presented in Chapter 13. Some further problems are listed in Chapter 14. The list of texts used and a short description of some non-European languages is added for orientation. Many earlier famous approaches, which have turned out to be infertile or have not been pursued after their publication, have not been considered in our study. New problems and new methods always replace many potentially productive approaches which were not even available earlier because of language barriers. We hope that the book gives a new impetus to study texts from different points of view, on the basis of word frequency – a popular kind of study, easy to conduct and nevertheless promising. It was not our aim to perform text critical analyses or to make statements about individual languages. Such an aim can be achieved only on the basis of
Preface
vii
thousands of texts of many languages and with respect to genre and text sort. We just tried to show some methodical aspects and vistas of word frequency, show some differences, dependencies, tests and possible ways of research. This is merely a beginning – we hope this book will inspire new investigations. The authors are very grateful for Relja Vulanovi´c’s most careful and competent editing of this text, for Christoph Eyrich’s helpful support in sophisticated LATEX issues, and for Silke Wagner’s help in preparing the layout. The authors
Contents
Preface
v
1
Introduction
1
2
Problems and presentations 2.1 Problems . . . . . . . . . . . . . . . . . . . . . . . 2.2 Presentations . . . . . . . . . . . . . . . . . . . . .
5 5 9
3
The h - and related points 3.1 The h-point . . . . . . . . . . . . . . . . . . . 3.1.1 A first look at vocabulary richness . . . . . . . 3.2 The k-point . . . . . . . . . . . . . . . . . . . 3.2.1 Definition . . . . . . . . . . . . . . . . . . . . 3.2.2 A second look at vocabulary richness . . . . . . 3.2.3 Indicator b . . . . . . . . . . . . . . . . . . . . 3.3 The m-point . . . . . . . . . . . . . . . . . . . 3.3.1 The m-coverage of text and vocabulary richness 3.4 Gini’s coefficient and the n-point . . . . . . . . 3.4.1 The n-point . . . . . . . . . . . . . . . . . . . 3.5 The role of N and V . . . . . . . . . . . . . . .
. . . . . . . . . . .
17 17 29 35 35 36 44 48 52 54 63 70
4
The geometry of word frequencies 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 4.2 The rank frequency distribution . . . . . . . . . . . . 4.3 The spectrum . . . . . . . . . . . . . . . . . . . . .
73 73 75 81
5
The dynamics of word classes
87
6
Thematic concentration of the text
95
7
Crowding, pace filling and compactness 7.1 Crowding . . . . . . . . . . . . . . . . . . . . . . . 7.2 Pace filling . . . . . . . . . . . . . . . . . . . . . . 7.3 Compactness . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
101 101 103 107
x Contents 8
9
Autosemantic text structure 8.1 Introduction . . . . . . . . . . . 8.2 The probability of co-occurrence 8.3 The construction of a graph . . . 8.4 Degrees . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
111 111 113 121 124
Distribution models 9.1 General theory . . . . . 9.2 Special cases . . . . . 9.3 Applications . . . . . . 9.4 The spectrum . . . . . 9.5 Evaluations . . . . . . 9.6 Ord’s criterion . . . . . 9.7 Repeat rate and entropy 9.8 Word classes . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
127 127 130 133 143 152 154 165 185
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
10
The relation of frequency to other word properties
195
11
Word frequency and position in sentence 11.1 Introduction . . . . . . . . . . . . . . 11.2 Runs of binary data . . . . . . . . . . 11.3 Runs of multiple data . . . . . . . . . 11.4 Absolute positions . . . . . . . . . . . 11.4.1 Word frequency in the final position . 11.4.2 Word frequency in the initial position . 11.5 Relative position . . . . . . . . . . . 11.6 Frequency motifs . . . . . . . . . . . 11.7 Distances between hapax legomena . .
12
13
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
203 203 206 209 210 211 213 214 218 227
The type-token relation 12.1 Introduction . . . . . . . . . . . . . . . . . . 12.2 Standard measurement . . . . . . . . . . . . 12.3 Köhler-Galle method . . . . . . . . . . . . . 12.4 The ratio method . . . . . . . . . . . . . . . 12.5 Stratified measurement (the window method) 12.6 The TTR of F-motifs . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
231 231 234 239 240 241 244
Conclusions
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
249
Contents
14
Appendix: Texts
xi 253
References
265
Subject index
271
Author index
275
1
Introduction
Frequency is one of the innumerable properties of the word. It is no intrinsic property because it cannot be measured directly on the word using some operational definitions. It can be determined by counting occurrences of the word in a finite specimen of text. Its relative value changes from text to text and its population value cannot be stated at all because there are no true populations in language (cf. Orlov, Boroda, and Nadarejšvili 1982). Hence the concept of word probability in language – as propagated by G. Herdan (1969) – has no empirical correlate unless one admits it in a statistical sense as the relative frequency in an enormous corpus – a term which is very suspicious. Word frequency is, however, a very simple property laying, so to say, on the surface of the text and staying at disposal not only to linguists but also to non-linguists. The counting can be performed mechanically and the results of counting can be used in typography, stenography, psychology, psychiatry, language teaching, cryptography, software production, etc. In linguistics, the uses of frequency are manifold, but only a part of them can be considered in this book. We shall restrict ourselves to different presentations and characterizations of frequency and the use of these characterizations for different purposes like vocabulary richness, text coverage or thematic concentration and autosemantic compactness. Using the concept of primary autosemantics we shall try to set up some concepts concerning the density of text. The theory of word frequency distributions will be treated from the viewpoint of a unique model. Using many texts in several languages we shall try to find out whether languages tend to special models or whether there is a general tendency as proposed by Zipf. The relations of frequency to other word properties will be studied in view of self-regulation and Köhler’s control cycle. We restrict ourselves to the study of length, which can be ascertained directly from the text, leaving all other properties aside. Only two kinds of conditional frequency will be studied, namely the common occurrence of words and the position in the sentence, though conditional frequency is a source of manifold problems. The classical problem of the type-token relation will be studied in four different variants: standard, Köhler-Galle variant, stratified measurement, and ratio measurement; then, their relation to other properties will be scrutinized.
2 Introduction Though word frequency continuously acquires more scientific importance, especially in historical linguistics and structure building (cf. e.g. Bybee & Hopper 2001) – i.e., with regard to langue in the Saussurian understanding of this term – it is still a neglected domain with grammarians, many of whom still believe in firmly wired hegemony of rules. Fortunately this attitude begins to die out. The following facts show that frequency is a basic property in language: the fact that language is differently stored in each of us – a matterof-fact for native multilinguals – and can be approached only by texts or utterances in which the frequency or the number of cases of a phenomenon decide whether a rule should be established or not, and the fact that grammaticality is acknowledged as a matter of degree (since the fifties of the last century). Therefore, different aspects of frequency deserve to be studied more thoroughly. The greatest problem in this type of research is the fuzziness of the units under study. However, the study of their frequency can perhaps at the same time help in the identification and family/class attribution of the units. The book should give a compact view of problems concerning word frequency in its different presentation forms, leaving aside structure building problems, influence on grammar, the historical influence of frequency etc. The greatest problems connected with this research are of two kinds: what is a text and what is a word? Since in science one should never ask “what is what” in order to avoid essentialism (Plato, Kant or Marx), we put the question differently today: what can we consider text and word? Both concepts are very fuzzy, here we restrict ourselves to text. The definition “any utterance is text” is widespread, indeed, covering simples utterances like “wow” or “yes” as well as complex written texts in a given language. But texts of a given language do not establish a homogeneous population. A text written today in London is not homogeneous with a text written tomorrow in Washington or Perth. Hence a corpus as a whole is no text for our purposes. We must restrict ourselves to individual texts in which homogeneity is better warranted. But what is homogeneity? As a matter of fact, only a spontaneously created text uttered in one go can be considered homogeneous. However, texts of this sort can be found only with Oriental storytellers. Neither Dante’s Divine Comedy, nor Tolstoy’s War and Peace, nor the Bible, nor any longer novel are homogeneous in our sense. They are, so to say, text mixtures. We must accept the fact that every pause in writing – used either for thinking, drinking coffee or sleeping – changes the cerebral rhythms and in turn at least one text property changes its course. A break
Introduction
3
arises. Not even the text author knows what changed. In the remote future when brain researchers will cooperate with linguists and linguists will know better the properties of texts, it will be possible to localize the breaks in text and analyze their quality. Today, we know at least that at the beginning of a new chapter in a novel the type-token curve signalizes a break. Hence long texts cannot be homogeneous, but what does “long” mean? No statistician had ever said what is a large or a small sample. One speaks about “sufficiently large”. In textology one can speak rather of texts that are “too large”, which are necessarily mixtures. In order to find a way out of this problem, we consider text as an uninterrupted sequence of sentences written/said in one go and having a certain coherence. In practice we must restrict the text length to about 10,000 words, to chapters of novels, to acts in dramas, to speech of individual persons in drama, avoid any kind of mixture, as far as possible, hoping that the investigated part is homogeneous. The requirement of homogeneity should be ascertained by an independent test but it is almost impossible both on personal (writers die) and on theoretical grounds. Thus it is more probable that the requirement is at least partially fulfilled by “not too long texts”, by texts not partitioned in chapters, by spontaneous speech or by letters. Usually we use texts of “sufficient length” but a theoretical investigation must also take into account texts composed of one word or one sentence, representing some limits of properties. Though there is a lower limit to text length, there is no upper limit. In any case we know that there are no infinite texts and we set rather arbitrarily the upper limit of a homogeneous text at 10,000 words. In order to compare languages, one frequently uses a text and its translations. Perhaps some of the properties remain unchanged, e.g. the story or the meaning of individual sentences, but a translated text cannot be considered either genuine or having even a trace of spontaneity. It may be useful for certain purposes but not for counting words because the translator tries to follow a foreign regime by force. Further, we do not use pathological texts, which can have an entirely different dynamics. We use “regular”, mostly literary texts whether published or written as private letters. For our empirical studies and analyses we decided to study texts from different languages, and within each of them, texts representing different text types. In our attempt to cover the specifics of various languages from different language families, we were lucky to get support from specialists and native speakers, and we are happy to express our gratitude for their competent help in providing, checking and/or pre-processing the respective texts.
4 Introduction In detail, we have received material for the following language groups (the “responsible” persons in charge are given in brackets): American (R. Pustet), Germanic (I.-I. Popescu), Indian (B.D. Jayaram and M.N. Vidya), Indonesian (G. Altmann), Polynesian (V. Krupa), Romanic (I.-I. Popescu), Slavic (P. Grzybek and L. Uhlíˇrová), Ugro-finnic (G. Altmann). As can be seen, the languages analyzed were chosen from different parts of the world in order to show the interlinguistic validity of the results; throughout this book, the individual texts are referred to by acronyms which are listed and deciphered in the appendix at the end of this book. The abbreviations used to denote the individual languages, from which the texts were taken rather randomly, can be taken from the following table. American Germanic Indian Indonesian Polynesian Romanic Slavic Ugro-finnic
Lakota [Lk] German [G], English [E] Kannada [Kn], Marathi [Mr] Tagalog [T], Indonesian [In] Hawaiian [Hw], Rarotongan [Rt], Samoan [Sm], Maori [M], Marquesan [Mq] Latin [Lt], Italian [I], Romanian [R] Russian [Ru], Slovenian [Sl], Czech [Cz], Bulgarian [B] Hungarian [H]
At least three texts were taken from each language; the text types vary from letters, over poetry, fairy tales, newspaper articles, Nobel lectures, scientific texts, biographies, short stories, to parts of novels. It was not our aim to study the word frequency in individual text sorts but rather to explore the existence of regularities in texts. We are of course fully aware of possible shortcomings as to the desirable systematics of analysis, which would have to take systematically into account sample size, text type, individual style, and many other factors. As a consequence we do not, of course, raise the claim that our results are equally representative across all languages studied in this book; rather, many more detailed and systematic analyses are likely to be necessary in order to gain further insight into the theoretical matters brought forth here. Yet, we are convinced that some relevant tendencies can be observed which may be further elaborated upon and, first and foremost, tested in future – may these observations serve as starting points for further research, which can already make use of the ideas developed in this book in terms of grounded hypotheses.
2 2.1
Problems and presentations
Problems
The problems associated with processing word frequencies appear at the very start of the investigation. Usually one distinguishes word forms and lemmas but neither of these quite familiar concepts can be unambiguously determined. In some languages the interference of written and spoken language makes the situation still more difficult. We must be aware of the fact that our definitions are not reflections of reality. They are conceptually constructed criteria which are not contained in the observed entities but serve as means for constructing data. Consequently, data are not found but constructed. Alphabetic or syllabic scripts seem to simplify the data problem: A word form can be defined as a sequence of letters between word separators if the given language is written with such separators, which is not always the case. But nevertheless, orthographic conventions do not necessarily follow linguistic concepts; therefore, affixes cannot automatically be distinguished from e.g., prepositions and postpositions. Other conventions concern short (colloquial or not) forms such as we’re or German Hast du’s gesehen?. Another problem is caused by discontinuous units (cf. the French negation ne . . . pas) and portmanteau morphemes (cf. French au for à le). Some languages, e.g. German and Hungarian, display an even more severe source of uncertainty in identifying the units to be counted. In these languages, separable affixes are common, where stem and affix form a continuous orthographic sequence in some grammatical forms while they appear far from each other and even in a different order in others (cf. German Sie will etwas abschreiben vs. Sie schreibt etwas ab (she wants to copy something vs. she copies something) or Hungarian kiszámol vs. számold ki (he computes vs. compute it!). Other examples of such complications from diverse languages are: In Japanese, the question as to whether the postpositions no, de, ni, wa,. . . are parts of word forms or separate words form a descriptive uncertainty while Hungarians are quite sure that morphemes like -ban, -ra, -on, -hoz,. . . are suffixes to the preceding word. Language does not care for crisp boundaries, there are always transitional phenomena resisting clear cut definitions. In Malay the prepositions di and ke are written separately but in Indonesian they can be written together with the next noun unless it begins with a capital. In case of Marathi
6 Problems and presentations and Kannada the case markers occur as a morph within the word, whereas in Hindi they occur independently as postpositions. According to Latin Standard Grammar, there should be no cases or persons allowing the affix -que – but in real texts there are a lot of words that do have it. Hence the analyst must make descriptive decisions: one word form or two? Both decisions influence the frequencies in text. Similar cases can be found in all languages. Lemmatisation is another procedure which cannot be performed without a number of decisions. Analytic (isolating) and agglutinative languages and languages with only little inflection such as English cause fewer problems than highly inflecting ones. Unfortunately, just the inflectional languages call for lemmatisation, otherwise all the different word forms would be counted as different words, which is, in many cases, disadvantageous. In linguistics, the notion lexeme has been introduced for an abstract entity which represents an expression with a certain lexical meaning and which is realised, depending on the grammatical context, by one of the word forms. For these abstract entities, representatives are determined by convention. Theoretically, numbers or any other kind of labels could be used. For practical reasons, however, one of the word forms is selected. These word forms are called lemmas; they are used as labels of the lexemes in dictionaries and for other forms of metacommunication. Lemmatisation is relatively unproblematic, at least in principle, as long as word forms are constructed by affixation or allomorphs including suppletion. Problems arise mainly in cases of fusions, e.g. of prepositions with determiners (including portmanteau morphemes such as French au for à le), with clitics and with all kinds of discontinuous morphemes, etc. Here, decisions must be made on the basis of some criteria, which then should be followed consequently. The same is true of a number of function words such as pronouns. The word forms I, me, my, he, him, his can be counted separately as different words or lemmatised into 4, 2, or even 1 lemma depending on how narrow the individual concept of a lexeme is. From the most systematic point of view, also we, us, our and all the other forms for different persons, cases, and numbers should be included into the lemma, in the same way as the different forms of the nouns and verbs are counted as a single lemma, ignoring differences in case, number, gender, tense etc. (cf. go, goes, going, went, gone) even if they are formed by suppletion. One has to decide which of the categories that a word form expresses are considered lexical ones and which grammatical. Another problem is the parts-of-speech attribution. Different approaches distinguish from ten to hundreds of classes based again on different criteria. A
Problems
7
tagger uses syntactic information to state the affiliation of the word(form) to a class but again, in some sentences only contextual information can help in the process of disambiguation. If a tagger distinguishes (the) hand and (to) hand or in German gehen and Gehen (go), a word counter does not, if one does not lemmatise or tag before counting. Thus a part of the analysis is evidently “incorrect” but we may use it nevertheless hoping that the deviations are not so serious as to falsify our hypotheses. These simple cases aim at showing the relativity and uncertainty of our knowledge. Our concept formation does not provide us with truth but with an approximation to something that lies behind our concepts. Of course, this is no reason for giving up, on the contrary, it is a reason to operationalise our concepts. Computer linguists are confronted with this problem daily when they are forced to tell the computer exactly what it has to do. In spite of this, a percentage of mechanical analyses is “incorrect” but a correction using pencil and paper does not mean greater accuracy, rather, it is mostly the use of other or more criteria. This boils down to the fact that presenting a frequency analysis of a text is never finished, it shows not only the regularities and the random fluctuations present in every text but also the deviations caused by conceptual doctrines, schools, opinions and by the fuzziness of any classification in language. In this uncomfortable situation – which does not differ from that in other sciences – we search for something that grants more validity to our analyses, a kind of independent criterion whose application can help us choose one of the alternative analyses. Such criteria are linguistic laws. Their establishment is, of course, a matter of long development, study of boundary conditions present in all languages, testing in as many languages as possible, etc. But once we have a law, we can say that the more adequate analysis is the one whose results more conform to the law. Laws are hypothetical statements aimed at describing a real mechanism, an objective (natural) law. Law-like statements must fulfill the following conditions (cf. Bunge 1961, 1967: 361): 1. They must be general in some respect and in some domain, i.e. they do not relate to individual objects. In our case they must hold for all languages; there are no “laws of English”. 2. They must be sufficiently corroborated empirically in some domain. This is a very tough criterion because every linguist is competent in only a few languages. There are no definitive corroborations and we only intuitively know what “sufficiently” means. But the more diversified the sample of languages and texts is, which corroborates laws, the more we trust them.
8 Problems and presentations 3. They must be elements of a theory, of a scientific system. In practice this means that they must be derived from a theory or from other laws and/or they must have some derivable consequences. This criterion is still harder because it eliminates as law candidates all statements which are just empirically based. Nevertheless, an empirical statement can give the first hint at the direction in which a theory could be constructed. But theory is no automatic result of inductive generalizations, it does not follow from “data”, it is a conceptual construct. The difference between laws and generalisations can be seen in Figure 2.1 (cf. Wimmer et al. 2003: 31). reconstruct aspects of Scientific laws
mechanisms and invariants
are tested by
generate
describe Inductive hypotheses generalize
observed regularities
Figure 2.1: The status of laws and inductive hypotheses
Thus, the greatest challenge of any frequency investigation is the establishment of laws, or even of a theory. Although we are still very far from this ideal, we shall try to show some possible ways and directions. Unfortunately, word frequency has so many aspects, that we can touch only some of them here. We shall show some more advanced aspects and introduce some new ones trying to hint at some possibilities. We shall have to leave out a number of aspects in which frequency is a cause of changes not only in speech, but in the language system, too; because this domain is still in its beginning empirical phase (cf. e.g. Bybee & Hopper 2001), focusing on empirical phenomena in individual languages. The above list of conditions represents only the general problems which we encounter when we try to process word frequencies. In individual languages there are many other problems which must be solved using appropriate criteria either before or after text processing. We shall show them in the following chapters and invite the reader to test different approaches and to compare the results obtained.
Presentations
2.2
9
Presentations
Word counting is a simple process, especially if one uses a computer, but the presentation of frequencies has several aspects that must be clearly distinguished. 1. The first distinction is between word forms and lemmas. Both can be in turn shaped using different criteria, e.g. one can consider clitics as parts of word forms or not, i.e. do we consider grammatical or phonetic word forms? In Slovak, a syllabic preposition takes the accent of the subsequent word and can be considered as a component of the phonetic word. In Indonesian, the interrogative particle kah can be added to almost any word, in Japanese writing one cannot always see whether the same particle ka is part of the word or not. In some Slavic languages one writes the reflexive pronoun together with the word, in other ones separately. Lemmas can be set up in a narrow sense, e.g. you, they and your, their are four different lemmas, but in a broad sense there are only two in this example, or even only one since person is a grammatical category. 2. The second distinction is that between a rank frequency distribution and the corresponding frequency spectrum. In the former, words are ordered according to their decreasing frequency, i.e. the variable x is the rank, and f (x) is the frequency. In the latter, g(x) is the number of words occurring exactly x times. We shall distinguish between f (x) and g(x) on practical grounds. These two representations can be transformed to each other, both mathematically and empirically. Empirically we simply add all f (x) at subsequent ranks with the same value, their sum yielding g(x) with x equal to f (x). 3. The third distinction is between plain counting, which is most common kind (yielding f (x) and g(x)), cumulative presentation, in which x
the frequencies are stepwise added yielding F(x) = ∑ f ( j), which is j=1
usually performed using relative frequencies, and the reverse cumulative presentation, in which both the variable and the frequencies are relativized but the ranking is reversed (the smallest frequency gets the smallest rank) and the summation begins at the last rank. We present examples below. Thus, we have 24 possibilities of presenting and evaluating word frequencies, of which 20 are shown in Table 2.1 (cf. also Popescu & Altmann 2006).
10 Problems and presentations Table 2.1: Possible presentations of word frequencies
word form Lemma
grammatical phonetic narrow broad
plain + + + +
Rank-frequency cum. reverse cum. + + + + + + + +
Frequency spectrum plain cum. + + + + + + + +
It is not clear whether a reverse cumulative frequency spectrum makes any sense, hence we do not consider it here. There are 20 different aspects of word frequency presentations left, each providing different insights into the dynamics of frequency in texts. For the sake of illustration consider the following frequencies found in a text: (I) 1, 5, 3, 1, 2, 8, 1, 13, 1, 2, 2, 1, 4, 3, 6, 15, 5, 7, 1, 2
We order them according to decreasing frequency to obtain the following plain ranking (see Figure 2.2). (II)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rank x 15 13 8 7 6 5 5 4 3 3 2 2 2 2 1 1 1 1 1 1 f (x)
Figure 2.2: Plain ranking of frequencies
Presentations
11
The cumulative ranking yields (III)
x 1 2 3 4 5 6 7 8 9 10
F(x) 15 28 36 43 49 54 59 63 66 69
Frel (x) 0.1807 0.3373 0.4337 0.5181 0.5904 0.6506 0.7108 0.7590 0.7952 0.8313
x 11 12 13 14 15 16 17 18 19 20
F(x) 71 73 75 77 78 79 80 81 82 83
Frel (x) 0.8554 0.8795 0.9036 0.9277 0.9398 0.9518 0.9639 0.9759 0.9880 1.0
Here one usually uses the relative cumulative frequencies, i.e. all numbers divided by their sum, e.g. F(1) = 15/83 = 0.1807; F(2) = 28/83 = 0.3373 etc. The highest rank rmax (here 20) is the vocabulary V of the text, the greatest F(rmax ) (here 83) is text length N. In Figure 2.3 one can see the cumulative ranking of relative frequencies (II) which is slightly smoother than the plain ranking.
Figure 2.3: Cumulative ranking of relative frequencies (F(x))
The reverse cumulative ranking can be constructed starting from (II) but reverting the frequency row, i.e.
12 Problems and presentations (IV) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rank 1 1 1 1 1 1 2 2 2 2 3 3 4 5 5 6 7 8 13 15 reversed f (x)
then the frequencies will be cumulated (V) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 reversed rank 1 2 3 4 5 6 8 10 12 14 17 20 24 29 34 40 47 55 68 83 cumulative f (x)
In a last step all ranks are divided by the highest rank (here 20) and the cumulative frequencies are divided by the sum of frequencies (practically by the last frequency, here F(20) = 83). Thus we obtain finally (VI)
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.011 0.024 0.036 0.048 0.060 0.072 0.096 0.120 0.145 0.169 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.205 0.241 0.289 0.349 0.401 0.482 0.566 0.663 0.819 1.000
The graphical presentation of (VI) can be seen in Figure 2.4.
Figure 2.4: Reversed cumulative relative frequency with relative ranks
Presentations
13
The frequency spectrum can be evaluated using (II). There are 6 ranks having frequency 1, i.e. x = 1, g(1) = 6; there are 4 ranks with frequency 2, i.e. x = 2, g(2) = 4 etc., yielding finally (VII)
1 2 3 4 5 6 7 8 13 15 x = frequency 6 4 2 1 2 1 1 1 1 1 g(x) = number of words with f (x)
The graphical presentation is in Figure 2.5. As can be seen, it differs drastically from the rank-frequency presentation.
Figure 2.5: Frequency spectrum
The cumulative frequency spectrum G(x) is obtained by simply adding up the frequencies in (VII) without skipping the missing values of the variable, yielding (VIII)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x 6 10 12 13 15 16 17 18 18 18 18 18 19 19 20 G(x)
The cumulative frequencies can be transformed to relative values – dividing by the G(xmax ), here 20 – yielding the cumulative relative frequency spectrum for which we do not reserve a special symbol, call it G(x) but the presentation clearly shows whether the absolute or relative frequencies are meant. In our case the cumulative relative frequencies are
14 Problems and presentations (IX)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.3 0.5 0.6 0.65 0.75 0.8 0.85 0.9 0.9 0.9 0.9 0.9 0.95 0.95 1.0
The graphical presentation is shown in Figure 2.6. This representation is used for capturing the text coverage. As can be seen, this presentation is not so “smooth” as that in Figure 2.3.
Figure 2.6: Cumulative relative frequencies of the spectrum (G(x))
In addition one can present both variables in relativized form, namely as < x/xmax , f (x)/ f (1) > for ranks or < x/xmax , g(x)/g(1) > for the spectrum, yielding an easy optical comparison between texts. In Figure 2.7 (p. 15) the rank frequency distribution is presented in this form; hence one could extend Table 2.1 (p. 10) into a third dimension in which all relativized presentations would be placed. Mathematically, one can devise indices with sampling properties, or models of these presentations, as will be shown in the following chapters. Each of the representations has its own task and characterizes an aspect of the word frequency distribution. Not all of them are used in linguistics with the same intensity; the reversed cumulative frequency has been used only once up to now (cf. Popescu & Altmann 2006). There are, of course, other problems which cannot be presented as pure frequencies; three of them will be shown in this book:
Presentations
15
Figure 2.7: Both variables in relativized form (rank-frequency)
1. the type-token relation, a problem inherited from the first serious steps in quantitative linguistics. It does not have a unique form and we shall show four of them. Nevertheless, they all can be treated uniformly; 2. words coded as frequencies yielding time series, frequency motifs, Fourier series, and 3. words occurring together in a frame unit, e.g. sentence, giving rise to associations and networks whose properties can be studied and interpreted textologically.
3 3.1
The h - and related points
The h -point
Consider first the plain rank frequency distribution as presented in (II) in Chapter 2. Such distributions always have a monotonously decreasing hyperbolic form but they are not always “smooth” in the sense that the frequencies are not positioned exactly on a theoretical hyperbolic curve. In empirical cases they are usually dispersed around the theoretical curve, and in some cases one even observes an almost regular oscillation. This is caused by a special character of the text or by a special manner of self-regulation. The fitting of theoretical curves (distributions) will be treated in Chapter 9. Here we can state that with increasing rank r and decreasing frequency f (r) there is a point at which r = f (r). This √point is called h-point and its distance to the origin [0, 0] is given by d = h 2, as can be seen in Figure 3.1.
Figure 3.1: The h-point definition: the “bisector” point of the rank frequency distribution at which rank = frequency
It was conceived by J.E. Hirsch for scientometrics (2005), introduced into linguistics by Popescu (2006) and further developed by Popescu & Altmann (2006, 2007) and Popescu, Best, & Altmann (2007). Obviously, the h-point
18 The h- and related points of actual discrete distributions is closely related to the mathematically important fixed point of continuous functions, and is defined by the same rule r = f (r) – cf. Sandefur (1990). Generally, if the rank frequency distribution, f (r), were a continuous function, the h-point would be a fixed point solution of the equation r = f (r), thus taking the advantage of the full mathematical support of the “fixed point theory and its applications” (just click on any Internet search engine for these key words). Since over 50 years, this theory has been revealed as a powerful tool in the study of nonlinear phenomena and applied in various fields such as biology, chemistry, economics, engineering, game theory, and physics. In the present book, we aim to present empirical arguments for the application of the fixed point concept to discrete linguistic distributions as well. It is very simple to find the h-point in practice. Either one finds a class in which rank really equals frequency or one can use the formula C =
1 1 = . f (r) − r f requency − rank
(3.1)
C initially increases attaining positive values, then a discontinuity occurs and C becomes negative and increases again. Joining the two extreme points, (r1 ,C1 ) and (r2 ,C2 ), around the discontinuity by a straight line, one finds the h-point at the intersection of the straight line and the x-axis. The procedure is shown in Figure 3.2 using two English texts (Bellow E05 and Banting E-11). Alternately, the h-point can be defined as that point at which the straight line between two (usually) neighbouring ranked frequencies intersects the y = x line. Solving two simultaneous equations we obtain the definition
h=
f (r1 )r2 − f (r2 )r1 f (r1 )(r2 − r1 ) − [ f (r2 ) − f (r1 )]r1 = r2 − r1 − [ f (r2 ) − f (r1 )] r2 − r1 + f (r1 ) − f (r2 )
(3.2)
which can be further simplified for r2 − r1 = 1. In most applications of the present book the h-point will be rounded to the closest integer. The h-point has some properties which make it useful for text analysis. As is well known, the auxiliaries, the synsemantics etc. are usually more frequent than autosemantics; consequently they occupy the low ranks lying before the h-point in the graph (see Figure 3.1). Thus, the h-point divides the vocabulary in two parts, namely in a class of magnitude h of frequent synsemantics or auxiliaries (prepositions, conjunctions, pronouns, articles, particles, etc.) and
The h-point
(a) Banting Nobel lecture
19
(b) Bellow Nobel lecture
Figure 3.2: h-point determination
a much greater class (V − h) of autosemantics which are not so frequent but build the very vocabulary of the text. Since the h-point separates this “rapid” branch of synsemantics from the “slow” branch of autosemantics, Popescu and Altmann (2006) mentioned the analogy to the “Maxwell’s demon” in physics that separates gas molecules of high velocity from those of small velocity. Of course, this separation is not clear-cut, sometimes there are autosemantics in the rapid branch and synsemantics in the slow branch. But this fact can be used to estimate the thematic concentration of the text as shown in Chapter 6. Thus the fuzziness of h as a separating point has its textological relevance. J.E. Hirsch (2005) has argued that there is a relationship between the hpoint and text length N, represented by the total area below the rank-frequency curve, namely N = ah2 . (3.3) Usually the dependence of a textological index on text length is very detrimental to any further discussion, but in this case the parameter a, a =
N , h2
(3.4)
is an index showing the partitioning of the text into parts whose size is adapted to the text length, i.e. the partitioning of the text using its own pace h.1 1. Actually, as recently argued (Popescu & Altmann 2009), the natural pace is (h − 1) inasmuch as the rank frequency origin is (1, 1) and not (0, 0).
20 The h- and related points In Table 3.1 we show texts of different length in different languages. As has been pointed out in the Introduction (see p. 4), the acronyms used to denote the individual texts are listed in the Appendix; the abbreviations of the languages can be found in the Introduction. Table 3.1: Text length (N), h-point (h) and index a Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10
N
a
h
Text ID
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246
10 8 9 8 7 9 8 6 9 7 9 11 19 7 9 13 15 8 6 11 16 22 19 23 26 24 25 26 29 28
7.61 5.50 6.36 7.55 8.29 8.48 8.70 7.44 6.79 11.35 12.89 8.13 7.92 10.65 12.33 9.54 8.95 10.58 12.78 9.55 9.10 6.14 8.99 8.74 7.04 8.44 8.01 7.52 6.78 7.97
Kn-080 Kn-081 Kn-082 Kn-100 Kn-101 Kn-102 Kn-104 Kn-105 Kn-143 Kn-186 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mr-001 Mr-002 Mr-003 Mr-004 Mr-005
N
a
h
4829 18 14.90 3247 13 19.21 1435 10 14.35 4828 16 18.86 2930 13 17.34 3801 14 19.39 578 6 16.06 3043 15 13.52 3812 17 13.19 3382 15 15.03 345 8 5.39 1633 17 5.65 809 12 5.62 219 6 6.08 3311 12 22.99 4010 18 12.38 4931 19 13.66 4285 20 10.71 1354 8 21.16 829 7 16.92 2062 18 6.36 1175 15 5.22 1434 17 4.96 1289 15 5.73 3620 26 5.36 2998 14 15.30 2922 18 9.02 4140 20 10.35 6304 24 10.94 4957 19 13.73 (continued on next page)
The h-point
21
Table 3.1 (continued from previous page) Text ID
N
a
h
Text ID
E-11 E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 H-01 H-02 H-03 H-04 H-05 In-01 In-02 In-03 In-04 In-05 I-01
8193 9088 11265 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 282 1829 3507 7892 7620 12356 2044 1288 403 936 413 376 373 347 343 414 11760
32 39 41 12 9 8 8 8 8 5 11 9 7 7 6 8 5 8 8 6 7 21 26 38 38 44 12 8 4 7 6 6 7 6 5 8 37
8.00 5.98 6.70 7.60 10.43 7.81 8.52 8.73 8.52 10.52 7.98 8.06 9.80 9.55 6.97 7.19 7.36 9.27 8.09 6.25 5.76 4.15 5.19 5.47 5.28 6.38 14.19 20.13 25.19 19.10 11.47 10.44 7.61 9.64 13.72 6.47 8.59
Mr-006 Mr-007 Mr-008 Mr-009 Mr-010 Mr-011 Mr-015 Mr-016 Mr-017 Mr-018 Mr-020 Mr-021 Mr-022 Mr-023 Mr-024 Mr-026 Mr-027 Mr-028 Mr-029 Mr-030 Mr-031 Mr-032 Mr-033 Mr-034 Mr-035 Mr-036 Mr-038 Mr-040 Mr-043 Mr-046 Mr-052 Mr-149 Mr-150 Mr-151 Mr-154 Mr-288 Mr-289
N
a
h
3735 19 10.35 3162 16 12.35 5477 27 7.51 6206 26 9.18 5394 27 7.40 3975 22 8.21 4693 20 11.73 3642 18 11.24 4170 19 11.55 4062 20 10.16 3943 19 10.92 3846 19 10.65 4099 21 9.29 4142 20 10.36 4255 20 10.64 4146 19 11.48 4128 21 9.36 5191 23 9.81 3424 17 11.85 5504 20 13.76 5105 21 11.58 5195 23 9.82 4339 19 12.02 3489 17 12.07 1862 11 15.39 4205 19 11.65 4078 20 10.20 5218 21 11.83 3356 16 13.11 4186 20 10.47 3549 17 12.28 2946 12 20.46 3372 16 13.17 4843 23 9.16 3601 17 12.46 4060 17 14.05 4831 19 13.38 (continued on next page)
22 The h- and related points Table 3.1 (continued from previous page) Text ID
N
a
h
Text ID
I-02 I-03 I-04 I-05 Kn-001 Kn-002 Kn-003 Kn-004 Kn-005 Kn-006 Kn-007 Kn-008 Kn-009 Kn-010 Kn-011 Kn-012 Kn-013 Kn-015 Kn-016 Kn-017 Kn-019 Kn-020 Kn-021 Kn-022 Kn-023 Kn-024 Kn-025 Kn-026 Kn-030 Kn-031 Kn-044 Kn-045 Kn-046 Kn-047 Kn-048 Kn-068 Kn-069
6064 854 3258 1129 3713 4508 3188 1050 4869 5231 4434 4393 3733 4483 4541 4141 1302 4456 4735 4316 1787 4556 1455 4554 4685 4588 4559 3716 4499 4672 2000 4304 4723 4084 2219 3530 4567
25 10 21 12 17 22 13 7 16 19 16 15 14 18 17 19 10 17 18 18 14 22 11 21 23 15 17 14 18 21 11 14 15 12 8 15 18
9.70 8.54 7.39 7.84 12.85 9.31 18.86 21.43 19.02 14.49 17.32 19.52 19.05 13.84 15.71 11.47 13.02 15.42 14.61 13.32 9.12 9.41 12.02 10.33 8.86 20.39 15.78 18.96 13.89 10.59 16.53 21.96 20.99 28.36 34.67 15.69 14.10
Mr-290 Mr-291 Mr-292 Mr-293 Mr-294 Mr-295 Mr-296 Mr-297 Mq-01 Mq-02 Mq-03 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 R-01 R-02 R-03 R-04 R-05 R-06 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05-
N
a
h
4025 17 13.93 3954 18 12.20 4765 19 13.20 3337 13 19.75 3825 17 13.24 4895 20 12.24 3836 18 11.84 4605 18 14.21 2330 22 4.81 457 10 4.57 1509 14 7.70 968 14 4.94 845 13 5.00 892 13 5.28 625 11 5.17 1059 15 4.71 1738 14 8.87 2279 16 8.90 1264 12 8.78 1284 10 12.84 1032 11 8.53 695 10 6.95 753 8 11.77 2595 16 10.14 3853 21 8.74 6025 25 9.64 17205 41 10.23 1487 17 5.15 1171 15 5.20 617 13 3.65 736 12 5.11 447 11 3.69 756 9 9.33 1371 13 8.11 1966 13 11.63 3491 21 7.92 5588 25 8.94 (continued on next page)
The h-point
23
Table 3.1 (continued from previous page) Text ID
N
a
h
Kn-070 Kn-071 Kn-075 Kn-079
3184 5258 4485 4610
15 10 22 21
14.15 52.58 9.27 10.45
Text ID T-01 T-02 T-02
N
a
h
1551 1827 2054
14 15 19
7.91 8.12 5.69
As can be seen in Figure 3.3, N plays no role any more but the individual languages are separated in a fuzzy way.
Figure 3.3: The relationship between N and a
Greater h (smaller a) is a sign of analytism, i.e. the number of word forms is smaller, the synthetic elements are replaced by synsemantics. Thus h and a are at the same time both characteristics of a text (within the given language) and signs of analytism/synthetism in cross-linguistic comparison. Using the index a we get rid of the dependence on N. From the statistical point of view, the cross-linguistic comparison could be performed using the mean values of a in two languages without any recourse to N or V . But in cross-linguistic comparison of two texts we must try to proceed in a different way.
24 The h- and related points Let us first consider the quantity a and its mean values for 20 languages, as shown in Table 3.2. Table 3.2: Mean values of quantity a in 20 languages Language
Mean a
Samoan Rarotongan Hawaiian Maori Lakota Marquesan Tagalog English Bulgarian German
4.56 5.02 5.37 5.53 5.69 5.69 7.24 7.65 7.81 8.39
Language Italian Romanian Slovenian Indonesian Russian Czech Marathi Kannada Hungarian Latin
Mean a 8.41 9.15 9.19 9.58 10.10 10.33 11.82 16.58 18.02 19.56
Though some languages are scarcely represented, the picture is quite persuasive. The smallest mean a’s belong to the most analytic (isolating) languages situated in Polynesia. The greater a, the greater the synthetism, which culminates, as a matter of fact, in Hungarian and Latin. Thus, without taking recourse to the morphology of languages, the h-point and the derived quantity a can help a linguist find the position of a language on the analytismsynthetism scale. A confrontation with Greenberg-Krupa indices (cf. Greenberg 1960; Krupa 1965) would be very interesting but here we treat only texts. Of course, for a better estimation more texts would be necessary. The cross-linguistic comparison is simple. One can compare any two numbers from Table 3.2. The variances for individual languages can be computed from the data in Table 3.1 and inserted in formula |a¯1 − a¯2 | t = r 1 1 s + n1 n2 where
s2 =
n1
n1
i=1
i=1
∑ (ai1 − a¯1 )2 + ∑ (ai2 − a¯2 )2
n1 + n2 − 2 and t has n1 + n2 − 2 degrees of freedom.
(3.5)
The h-point
25
Let us for example compare Tagalog and Indonesian which belong to the same language group, and have a values of 7.24 and 9.27 respectively. From Table 3.1 we get for the nominator Tagalog: (7.91 − 7.24)2 + (8.12 − 7.24)2 + (5.69 − 7.24)2 = 3.6258; n=3 Indonesian: (10.44 − 9.27)2 + (7.61 − 9.27)2 + (9.64 − 9.27)2 + +(13.72 − 9.27)2 + (6.47 − 9.27)2 = 31.9039; n=5 √ hence s2 = (3.6258 + 31.9039)/(5 + 3 − 2) = 5.9216, and s = 5.9216 = 2.4334. Inserting in (3.5) we obtain t=
|7.24 − 9.27| r = 1.14 . 1 1 2.4334 + 3 5
(3.6)
Since for a two-sided test t0.05 (6) = 2.45, the difference is not significant. We present two other methods to compare the above mentioned text characteristics (cf. Maˇcutek, Popescu, & Altmann 2007): differences between cumulative probabilities corresponding to h-points and differences between aindices will be tested. To apply the first approach, take two texts and denote by h1 and h2 their h-points, by cph1 and cph2 the cumulative probabilities at h1 and h2 , and by N1 and N2 the numbers of word-forms or lemmas in the texts, respectively. For sufficiently large N1 and N2 the statistics U=s
c pˆh1 − c pˆh2
(3.7)
c pˆh1 (1 − c pˆh1 ) c pˆh2 (1 − c pˆh2 ) + N1 N2
has approximately the standard normal distribution (almost all texts, with the exception of pathologically short ones, are long enough to satisfy the conditions under which the formula can be used). We remind that we do not compare directly the h-points.
26 The h- and related points The formula can be used to test whether the ratios number o f word f orms (lemmas) with f requencies higher than h number o f all word f orms (lemmas) are significantly different. Let us demonstrate the method on two texts. Table 3.3 shows the results for Goethe’s Erlkönig. Table 3.3: Cumulative distribution of word frequencies in Goethe’s Erlkönig r
f (r)
∑ f (r)
F(r)
r
1 11 11 0.0489 21 2 9 20 0.0889 22 3 9 29 0.1289 23 24 4 7 36 0.1600 5 6 42 0.1867 25 26 6 6 48 0.2133 7 5 53 0.2356 27 8 5 58 0.2578 28 9 4 62 0.2756 29 10 4 66 0.2933 30 11 4 70 0.3111 31 32 12 4 74 0.3289 13 4 78 0.3467 33 34 14 4 82 0.3644 15 4 86 0.3822 35 16 3 89 0.3956 36 17 3 92 0.4089 37 38 18 3 95 0.4222 19 3 98 0.4356 39 20 3 101 0.4489 40-124* * The ranks 40 to 124 have frequency 1
f (r)
∑ f (r)
3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 225
F(r) 0.4622 0.4711 0.4800 0.4889 0.4978 0.5067 0.5156 0.5244 0.5333 0.5422 0.5511 0.5600 0.5689 0.5778 0.5867 0.5956 0.6044 0.6133 0.6222 1
By way of a comparison Table 3.4 shows the results for Peregrina by Moericke.
The h-point
27
Table 3.4: Cumulative distribution of word frequencies in Moericke’s Peregrina r
f (r)
∑ f (r)
F(r)
1 16 16 0.0270 2 16 32 0.0540 3 12 44 0.0742 4 12 56 0.0944 5 11 67 0.1130 6 10 77 0.1298 7 8 85 0.1433 8 8 93 0.1568 9 7 100 0.1686 10 7 107 0.1804 11 6 113 0.1906 12 6 119 0.2007 13 6 125 0.2108 14 6 131 0.2209 15 5 136 0.2293 16 5 141 0.2378 17 5 146 0.2462 18 5 151 0.2546 19 5 156 0.2631 20 5 161 0.2715 21 4 165 0.2782 22 4 169 0.2850 23 4 173 0.2917 24 4 177 0.2985 25 4 181 0.3052 26 3 184 0.3103 27 3 187 0.3153 28 3 190 0.3204 29 3 193 0.3255 30 3 196 0.3305 31 3 199 0.3356 32 3 202 0.3406 33 3 205 0.3457 34 3 208 0.3508 35 3 211 0.3558 36 3 214 0.3609 37 2 216 0.3642 The ranks 74 to 378 have frequency 1
r
f (r)
∑ f (r)
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74-378*
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276 278 280 282 284 286 288 593
F(r) 0.3676 0.3710 0.3744 0.3777 0.3811 0.3845 0.3879 0.3912 0.3946 0.3980 0.4013 0.4047 0.4081 0.4115 0.4148 0.4182 0.4216 0.4250 0.4283 0.4317 0.4351 0.4384 0.4418 0.4452 0.4486 0.4519 0.4553 0.4287 0.4621 0.4654 0.4688 0.4722 0.4755 0.4789 0.4823 0.4857 1
28 The h- and related points Inserting the values from Tables 3.3 and 3.4 into (3.7) we have U=r
0.2133 − 0.1568
0.2133(1 − 0.2133) 0.1568(1 − 0.1568) + 225 593
= 1.8153
which means that for α = 0.05 we do not reject the hypotheses that the cumulative probabilities corresponding to the h-points in these poems are equal (u0.955 = 1.96 ). We applied the test for differences between cumulative probabilities corresponding to the h-points to 13 texts by Goethe and Eminescu and present the results in Table 3.5 (the bold font indicates significant differences). Table 3.5: U-test for the difference between cumulative probabilities corresponding to h-points G-05 G-09 G-10 G-11 G-12 G-14 G-17 R-01 R-02 R-03 R-04 R-05 R-06 G-05 G-09 G-10 G-11 G-12 G-14 G-17 R-01 R-02 R-03 R-04 R-05 R-06
– -0.60 -1.84 -2.64 -0.91 -1.09 -0.43 -0.02 0.38 -0.33 -0.86 -0.20 -0.28
0.60 – -1.33 -2.16 -0.46 -0.69 0.01 0.73 2.12 -0.37 -2.17 0.06 -0.19
1.84 1.33 – -0.77 0.58 0.26 0.99 2.24 3.57 1.17 -0.49 1.51 1.17
2.64 2.16 0.77 – 1.21 0.83 1.58 3.25 4.63 2.11 0.44 2.43 2.01
0.91 0.46 -0.58 -1.21 – -0.23 0.38 1.01 1.97 0.24 -1.02 0.53 0.32
1.09 0.69 -0.26 -0.83 0.23 – 0.58 1.19 2.03 0.50 -0.61 0.76 0.56
0.43 -0.01 -0.99 -1.58 -0.38 -0.58 – 0.46 1.34 -0.26 -1.44 0.03 -0.15
0.02 -0.73 -2.24 -3.25 -1.01 -1.19 -0.46 – 1.86 -1.38 -3.81 -0.78 -0.98
-0.38 -2.12 -3.57 -4.63 -1.97 -2.03 -1.34 -1.86 – -3.17 -5.80 -2.41 -2.42
0.33 0.37 -1.17 -2.11 -0.24 -0.50 0.26 1.38 3.17 – -2.22 0.49 0.15
0.86 2.17 0.49 -0.44 1.02 0.61 1.44 3.81 5.8 2.22 – 2.59 2.00
0.20 -0.06 -1.51 -2.43 -0.53 -0.76 -0.03 0.78 2.41 -0.49 -2.59 – -0.27
0.28 0.19 -1.17 -2.01 -0.32 -0.56 0.15 0.98 2.42 -0.15 -2.00 0.27 –
While the previous approach is distribution free (i.e., the method does not have any assumption on the word rank frequency distribution), in the next test for the difference between a-indices we use the right truncated zeta distribution as a model (cf. Chapter 9). Denote by a1 and a2 the a-indices in two texts. The statistics a1 − a2 Ua = p Var (a1 ) +Var (a2 ) has, again, approximately the standard normal distribution. As the variances of the a-indices (cf. the denominator in the formula above) are not known and
The h-point
29
attempts to approximate them were not successful, we ran some computer simulations. Erlkönig is used as an example again. We generate 225 random numbers from the right truncated zeta distribution (there are 225 words in Erlkönig) with the parameter 0.6007 (which is the parameter value yielding the best fit). In the next step the h-point for these random numbers is found. The procedure is repeated 100 times, resulting in 100 h-points from the samples with the same size and with the same distribution as word frequencies in Erlkönig. Then we obtain the a-indices corresponding to the h-points and we compute their variance. Finally, the process is repeated 10 times, i.e. we have 10 variance values, each of them being a variance of 100 a-indices. We take their mean as the variance of the a-index. Usually (much) higher numbers of randomly generated samples are used, but the simulation study requires quite a lot of time and our aim is to present a method, not to find some kind of definitive solution. In our texts we have Erlkönig: a1 =
N1 225 = 2 = 6.25 h21 6
Peregrina: a2 =
N2 593 = 2 = 9.2656 8 h22
We obtain Var(a1 ) = 48.82 for Erlkönig and Var(a2 ) = 99.05 for Peregrina, hence we have 6.25 − 9.2656 Ua = √ = −0.248 48.82 + 99.05 which means that for α = 0.05 the a-indices for Erlkönig and Peregrina are not significantly different.
3.1.1 A first look at vocabulary richness Let us now consider the cumulative function of the ranked frequencies, F(r) (in statistical terms it is called empirical distribution function). It can be obtained if we add up the frequencies in a step-by-step procedure, beginning from rank one. Consider the rank frequency distribution of words in Goethe’s Erlkönig and the cumulative distribution in Table 3.3. As mentioned
30 The h- and related points before, the h-point separates the auxiliaries from autosemantics in a fuzzy way. The cumulative relative frequency up to the h-point, i.e. F(h), represents the h-coverage of the text. In Table 3.3 where the h-point is h = 6 we have F(6) = 0.2133, i.e. the h-part of the text (the first 6 words) fills 21.33% of the text. One can say that this is the coverage by auxiliaries. But as already mentioned, there can be some autosemantics among them, and vice versa, some auxiliaries are located after the h-point. In order to make a slight correction we consider the fact that the square built by h has the area of h2 as can be seen in Figure 3.1 (p. 17). The half of this area relativized by N – which is the full area under the rank-frequency curve – will be subtracted from F(h) yielding a new index F(h) = F(h) −
h2 . 2N
(3.8)
In our example, we would obtain F(6) = 0.2133−62 /(2·225) = 0.2133− 0.08 = 0.1333, which represents the corrected h-coverage of Goethe’s text (G-17). This number is simply a proportion to which a constant is added, hence a construction of an asymptotic test does not present difficulties. Now, the full area of the distribution from which F(h) is subtracted, i.e. 1 − F(h), represents an aspect of the vocabulary richness of the texts. This area contains not only hapax legomena but also important autosemantics repeated several times. For Goethe (G-17) we obtain 1 − F(h) = 0.8667. If we consider one language, it seems that there can appear a certain dependence of 1 − F(h) on N but if we use texts from different languages with all text lengths sufficiently represented, we can observe instead that 1 − F(h) builds a cloud-like area that cannot be captured by any function. As a matter of fact, the determination coefficient for different functions is so small that one cannot speak of a dependence. Hence 1 − F(h) is an acceptable coefficient of vocabulary richness. Table 3.6 presents texts of different languages.
The h-point
31
Table 3.6: Vocabulary richness R1 = 1 − F(h) for 176 texts in 20 languages Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12
N
h
F(h)
R1
Text ID
N
h
F(h)
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8193 9088
10 8 9 8 7 9 8 6 9 7 9 11 19 7 9 13 15 8 6 11 16 22 19 23 26 24 25 26 29 28 32 39
0.2286 0.2358 0.2039 0.1988 0.2118 0.2344 0.1759 0.1716 0.2273 0.2284 0.1753 0.2124 0.2988 0.1513 0.2262 0.2593 0.2895 0.1876 0.1957 0.2137 0.3004 0.3618 0.3745 0.3674 0.3683 0.4093 0.3675 0.4552 0.3898 0.4092 0.4129 0.4547
0.8371 0.8551 0.8747 0.8675 0.8485 0.8246 0.8816 0.8956 0.8463 0.8157 0.8635 0.8491 0.7644 0.8956 0.8143 0.7931 0.7664 0.8597 0.8434 0.8386 0.7545 0.7197 0.6811 0.6898 0.7027 0.6499 0.6950 0.6113 0.684 0.6536 0.6496 0.6290
Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-001 Mr-002 Mr-003 Mr-004 Mr-005 Mr-006 Mr-007 Mr-008 Mr-009 Mr-010 Mr-015 Mr-016 Mr-017 Mr-018 Mr-020 Mr-021 Mr-022 Mr-023 Mr-024 Mr-026 Mr-027 Mr-028
1354 829 2062 1175 1434 1289 3620 2330 457 1509 2998 2922 4140 6304 4957 3735 3162 5477 6206 5394 4693 3642 4170 4062 3943 3846 4099 4142 4255 4146 4128 5191
8 7 18 14 17 15 26 22 10 14 14 18 20 24 19 19 16 27 26 27 21 18 19 20 19 20 21 20 20 19 21 23
0.1041 0.0953 0.4617 0.4562 0.4888 0.4523 0.5221 0.5571 0.4289 0.4884 0.1301 0.1886 0.1452 0.1950 0.1434 0.1590 0.1556 0.2657 0.1640 0.2469 0.1862 0.1557 0.1530 0.1839 0.1643 0.1508 0.2140 0.1743 0.1753 0.1643 0.1999 0.1761
R1 0.9195 0.9343 0.6169 0.6272 0.6120 0.6350 0.5713 0.5468 0.6805 0.5765 0.9026 0.8668 0.9031 0.8507 0.8930 0.8893 0.8849 0.8009 0.8905 0.8207 0.8608 0.8888 0.8903 0.8653 0.8815 0.9012 0.8398 0.8740 0.8717 0.8792 0.8535 0.8749
(continued on next page)
32 The h- and related points Table 3.6 (continued from previous page) Text ID
N
h
F(h)
R1
Text ID
N
h
F(h)
E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01 In-02
11265 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936 413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376 373
41 12 9 8 8 8 8 5 11 9 7 7 6 8 5 8 8 6 12 8 4 7 6 7 21 26 38 38 44 37 25 10 21 12 6 7
0.4663 0.2749 0.2249 0.2400 0.2367 0.2272 0.1890 0.1711 0.2238 0.2129 0.1813 0.1624 0.1992 0.2000 0.1902 0.1568 0.1564 0.2133 0.2495 0.2073 0.1886 0.2041 0.2034 0.3298 0.5489 0.5441 0.6405 0.6382 0.6475 0.3423 0.3130 0.2365 0.3011 0.2524 0.1729 0.2038
0.6083 0.7909 0.8230 0.8240 0.8220 0.8300 0.8697 0.8764 0.8389 0.8491 0.8697 0.8900 0.8725 0.8696 0.8777 0.8972 0.9054 0.8667 0.7857 0.8175 0.8313 0.8221 0.8402 0.7571 0.5717 0.5523 0.4510 0.4566 0.4308 0.7159 0.7385 0.822 0.7666 0.8114 0.8750 0.8619
Mr-029 Mr-030 Mr-031 Mr-032 Mr-033 Mr-034 Mr-035 Mr-036 Mr-038 Mr-040 Mr-043 Mr-046 Mr-052 Mr-149 Mr-150 Mr-151 Mr-154 Mr-288 Mr-289 Mr-290 Mr-291 Mr-292 Mr-293 Mr-294 Mr-295 Mr-296 Mr-297 R-01 R-02 R-03 R-04 R-05 R-06Rt-01 Rt-02 Rt-03
3424 5504 5105 5195 4339 3489 1862 4205 4078 5218 3356 4186 3549 2946 3372 4843 3601 4060 4831 4025 3954 4765 3337 3825 4895 3836 4605 1738 2279 1264 1284 1032 695 968 845 892
17 20 21 23 19 17 11 19 20 21 16 20 17 12 16 23 17 17 19 17 18 19 13 17 20 18 18 14 16 12 10 11 10 14 13 13
0.1618 0.1428 0.1589 0.1883 0.1348 0.1253 0.0956 0.1620 0.1721 0.1479 0.1159 0.1732 0.1677 0.0995 0.1260 0.2125 0.1402 0.1490 0.1519 0.1150 0.1629 0.1624 0.0869 0.1545 0.1544 0.1689 0.1581 0.2267 0.2519 0.2057 0.1713 0.2141 0.2086 0.5072 0.4769 0.4294
R1 0.8804 0.8935 0.8843 0.8626 0.9068 0.9161 0.9369 0.8809 0.8769 0.8944 0.9222 0.8746 0.8730 0.9249 0.9120 0.8421 0.8999 0.8866 0.8855 0.9209 0.8781 0.8755 0.9384 0.8833 0.8865 0.8733 0.8771 0.8297 0.8043 0.8513 0.8676 0.8445 0.8633 0.5940 0.6231 0.6653
(continued on next page)
The h-point
33
R1
Table 3.6 (continued from previous page) Text ID
N
h
F(h)
R1
Text ID
N
h
F(h)
In-03 In-04 In-05 Kn-003 Kn-004 Kn-005 Kn-006 Kn-011 Kn-012 Kn-013 Kn-016 Kn-017 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04
347 343 414 3188 1050 4869 5231 4541 4141 1302 4735 4316 345 1633 809 219 3311 4010 4931 4285
6 5 8 13 7 16 20 17 19 10 18 18 8 17 12 6 12 18 19 20
0.1643 0.1137 0.2029 0.0994 0.0819 0.1327 0.1357 0.1132 0.1384 0.1421 0.1271 0.1383 0.2870 0.4489 0.4054 0.2785 0.1130 0.1686 0.1373 0.1998
0.8876 0.9227 0.8744 0.9271 0.9414 0.8936 0.9025 0.9186 0.9052 0.8963 0.9071 0.8992 0.8058 0.6396 0.6836 0.8037 0.9087 0.8718 0.8993 0.8469
Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
625 1059 753 2595 3853 6025 17205 756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
11 15 8 16 21 25 41 9 13 13 21 25 17 15 13 12 11 14 15 19
0.4224 0.4448 0.1939 0.2405 0.2577 0.2574 0.2992 0.2037 0.2392 0.2477 0.3861 0.2770 0.5629 0.5115 0.5413 0.5041 0.4832 0.3972 0.3969 0.4747
0.6744 0.6614 0.8486 0.8088 0.7995 0.7945 0.7497 0.8499 0.8224 0.7953 0.6771 0.7789 0.5343 0.5846 0.5957 0.5937 0.6521 0.6660 0.6647 0.6132
It would be possible to consider the languages separately but in that case one would need a large number of texts for each and, as far as we could state, the result was the same. Let us call this index R1 . Its final form is h2 R1 = 1 − F(h) − . (3.9) 2N Only F(h) is a variable representing a proportion (the rest are constants) and it is easy to set up an asymptotic test. Since Var(R1) = F(h)[1−F(h)]/N, the difference between two texts can be tested using the familiar normal variable yielding R1 − R2 z = p . Var(R1 ) +Var(R2 )
(3.10)
For the sake of ease, the proportion F(h) was included in the table. Consider e.g. the difference between Latin Lt-04 and Tagalog T-03 in the last row
34 The h- and related points of Table 3.6. Using the numbers in the table we can directly insert the values in formula (3.10) and obtain
z = r
0.8469 − 0.6132
0.1998(1 − 0.1998) 0.4747(1 − 0.4747) + 4285 2054
= 18.55 ,
a value which is highly significant. Of course, these asymptotic tests strongly depend on N but at least a preliminary classification of texts is possible. As can be seen in Figure 3.4, this index does not depend on text length N.
Figure 3.4: The relation of R1 to N
However, one could, perhaps, find language-dependent or even authordependent differences. It is also possible that richness expressed by this index bears the traces of analytism/synthetism of the given language. Here, too, Latin texts have the greatest richness and Polynesian text the smallest. But Hungarian is no more the neighbour of Latin, hence genre can play a decisive role (in Latin we have literary texts, in Hungarian journalistic ones). Many particular investigations within one language are necessary in order to find all factors contributing to this kind of richness measurement.
The k-point
3.2
35
The k -point
3.2.1 Definition The k-point is ascertained for the frequency spectrum in analogy to the hpoint for ranks. It is the point at which the frequency x is equal to the number of cases (or occurrences) that have exactly frequency x; this number will be called g(x) in the following chapters. The frequency spectrum is based on the rank frequency distribution, showing the number g(1) of words that occur exactly once, the number g(2) of words that occur exactly twice, etc. This means, one goes through the rank-frequency distribution beginning from below. In Table 3.7 we show the frequency spectrum in Goethe’s Erlkönig using the second column of Table 3.3. Table 3.7: The frequency spectrum of word forms in Goethe’s Erlkönig Frequency x
Number of cases with frequency x [g(x)]
1 2 3 4 5 6 7 9 11
85 18 6 7 2 2 1 2 1
V
124
In order to determine the k-point, one can use the same procedure as described in Section 3.1 above. If there is no identity x = g(x), one can interpolate between the two values x < g(x) and x + 1 > g(x + 1). In the given example it is exactly the mean of 4 and 5 but this does not have to be the case generally. The two critical points are given by x = 4 and x = 5. From formula (3.2) we compute C1 = [g (x) − x]−1
and C2 = [g (x + 1) − (x + 1)]−1
which yields C1 = (7 − 4)−1 = 0.3333 and C2 = (2 − 5)−1 = −0.3333.
36 The h- and related points Our two points are now A1 = hx,C1 i = h4, 0.3333i
and A2 = hx + 1,C2 i = h5, −0.3333i .
We set up the usual equation y − y1 =
y2 − y1 (x − x1 ) x2 − x1
(3.11)
and inserting the stated values we obtain y − 0.3333 =
−0.3333 − 0.3333 (x − 4) 5−4
from which y = 3 − 0.6667x . Our k-point is the point of intersection with the x-axis, i.e. that point where y = 0. Inserting this value in the above formula, we obtain x = 4 = k, where k denotes both coordinates of the k-point. Again, the k-point is the point closest to the origin h0, 0i as can be seen in Figure 3.5. It is a specific text characteristic, but it depends strongly on the vocabulary V which in turn depends on text length N. Nevertheless, we know that words from x = 1 to x = k are just the autosemantics making up the vocabulary richness. But before we define a new index, some notes on vocabulary richness are in order.
3.2.2 A second look at vocabulary richness The vocabulary richness of a given text is a problematic concept which suffers from the dependence of the majority of procedures on text length N. The longer the text, the (relatively) smaller the increase of different words (V ) in it. Hence if a text is “sufficiently” long, the rate of change of the majority of type-token indices, dV /dN, must converge to zero. If they converge to infinity, they are in principle wrong, though measurement in finite texts is possible. In previous research, one has more often than not tried to avoid this circumstance by building various functions of logarithms of V and N, but the confidence intervals of these indices are so enormous that texts of different
The k-point
37
Figure 3.5: The determination of the k-point in frequency spectrum
length are hardly comparable (for a survey see e.g. Wimmer 2005). The number of different indices and type-token curves is rather great and there is a large amount of literature on this topic. Another problem is rather of a qualitative kind. Do synsemantics and auxiliaries really contribute to vocabulary richness? They are present in all texts, do not contribute much to the content of the text and if they are inflectional or suppletive (e.g. I, my, we, our, etc.) they furnish a lot of different redundant forms. Of course, if one counts lemmas, their number can be restricted. Still, they occur at the initial ranks of the rank frequency distribution. In order to avoid this problem one can eliminate such words – but again, every researcher would solve the problem differently. Another possibility consists in counting only the hapax legomena – which is a subjective decision; – but it is not clear why words occurring twice or more times do not contribute to vocabulary richness. Besides, the number of hapax legomena strongly depends on N. Searching for other possibilities Popescu and Altmann (2006) have considered the fact that the h-point and the k-point are objective, though fuzzy, boundary points separating auxiliaries (in general) from autosemantics. Of course, some autosemantics occur sometimes in the domain above h or below k and auxiliaries occur also in the domain below h or above k. But these points take into account the complete distribution and can be computed without any
38 The h- and related points subjective decision. Above we have shown that the index R1 = 1 − F(h) is a quite reliable index of what we call vocabulary richness. Here we can do the same with the k-point and derive a new index. First we define the cumulative relative frequency of the spectrum up to the k-point as G(k). One can find the individual values for the Erlkönig by Goethe in the third column of Table 3.8. Table 3.8: Cumulative relative frequency of the spectrum in Goethe’s Erlkönig Frequency x
Number of cases with frequency x [g(x)]
Absolute
G(x)
1 2 3 4 5 6 7 9 11
85 18 6 7 2 2 1 2 1
85 103 109 116 118 120 121 123 124
V
124
Relative 0.6855 0.8306 0.8790 0.9355 0.9516 0.9677 0.9758 0.9919 1.0000
In order to clean this domain out of synsemantics we subtract from it the half of the square built between the k-point and the two axes, relativized by V , namely k2 /(2V ) – in the same way as we did it in Section 3.1.1. Thus we obtain a new richness index R2 = G(k) −
k2 = G(k). 2V
(3.12)
Applied to the Goethe-Text (G 17) in Table 3.8 we obtain (taking simply the mean) k = 4.5, G(4.5) = 0.9436 and V = 124, hence R2 (Erlkönig) =0.9436 − 4.52 / [2 (124)]
=0.9436 − 0.0817 = 0.8619 =G(k).
In Table 3.9 we present the results of computing R2 for several texts and sort them according to V .
The k-point
39
Table 3.9: Vocabulary richness R2 for 176 texts in 20 languages (ranked according to V ) Text ID
N
V
k
G(k)
Hw-01 Lk-04 G-17 Sm-05 G-14 Sm-03 Mq-02 Sm-04 G-07 G-12 Lk-01 B-08 Rt-04 In-05 In-03 Rt-05 B-02 Rt-03 In-02 In-04 Rt-02 In-01 Sm-02 Rt-01 B-05 G-13 Hw-02 Cz-09 Sm-01 G-04 Lk-03 M-02 M-03 G-03 B-03
282 219 225 447 184 617 457 736 263 251 345 268 625 414 347 1059 352 892 373 343 845 376 1171 968 406 460 1829 460 1487 545 809 1175 1434 500 515
104 116 124 124 129 140 150 153 169 169 174 179 181 188 194 197 201 207 209 213 214 221 222 223 238 253 257 259 267 269 272 277 277 281 285
5 4 4 3 3 4 5 4 4 4 4 3 7 5 5 6 4 8 4 4 5 3 6 6 4 4 8 5 6 4 6 6 8 5 4
0.8942 0.9310 0.9355 0.7823 0.9380 0.7714 0.8933 0.7647 0.9586 0.9527 0.9138 0.9274 0.9006 0.8989 0.9639 0.797 0.9353 0.9034 0.9474 0.9671 0.8505 0.9005 0.8423 0.8789 0.9412 0.9288 0.8288 0.9653 0.8689 0.9219 0.9191 0.8592 0.8628 0.9573 0.9333
R2 = G(k) Text ID 0.774 0.862 0.871 0.746 0.9031 0.7143 0.8100 0.7124 0.9113 0.9054 0.8678 0.9022 0.7652 0.8324 0.8995 0.7056 0.8955 0.7488 0.9091 0.9295 0.7921 0.8801 0.7612 0.7982 0.9076 0.8972 0.7043 0.9170 0.8015 0.8922 0.8529 0.7942 0.7473 0.9128 0.9052
N
V
k
G(k)
Lt-05 1354 909 6 0.9802 E-01 2330 939 6 0.9414 E-08 5083 985 12 0.9442 E-03 3247 1001 8 0.952 E-02 2971 1017 7 0.9351 Hw-06 12356 1039 10 0.8479 H-01 2044 1079 7 0.9815 Sl-04 3491 1102 8 0.9528 Mr-035 1862 1115 7 0.9812 E-06 4862 1176 9 0.9379 R-02 2279 1179 8 0.9796 Mr-002 2922 1186 7 0.9401 E-04 4622 1232 7 0.9221 I-04 3258 1237 6 0.9418 Ru-02 2595 1240 7 0.9645 Mr-007 3162 1262 8 0.9493 Cz-03 2858 1274 7 0.9631 E-10 6246 1333 10 0.9347 Mr-027 4128 1400 9 0.9429 Mr-029 3424 1412 9 0.9653 Mr-046 4186 1458 10 0.9575 E-05 4760 1495 7 0.9472 Mr 006 3735 1503 10 0.9694 Mr-150 3372 1523 8 0.9626 Mr-149 2946 1547 8 0.9767 Mr-001 2998 1555 9 0.9788 E-09 5701 1574 8 0.9428 E-07 5004 1597 7 0.9474 Mr-038 4078 1607 9 0.9614 Mr-052 3549 1628 9 0.9705 Mr-010 5394 1650 10 0.9473 E-13 11265 1659 11 0.9066 E-11 8193 1669 11 0.9401 Mr-151 4843 1702 10 0.9618 Mr-022 4099 1703 9 0.9659
R2 = G(k) 0.9604 0.9222 0.8711 0.9200 0.9110 0.7998 0.9588 0.9238 0.9592 0.9035 0.9525 0.9194 0.9022 0.9272 0.9447 0.9239 0.9439 0.8972 0.9140 0.9366 0.9232 0.9308 0.9361 0.9416 0.9560 0.9528 0.9225 0.9321 0.9362 0.9456 0.9170 0.8701 0.9039 0.9324 0.9421
(continued on next page)
40 The h- and related points Table 3.9 (continued from previous page) Text ID
N
V
k
G(k)
B-04 Mq-01 H-05 H-03 G-16 G-11 G-10 Mq-03 B-09 B-10 Cz-04 B-07 G-06 M-04 G-05 G-02 G-15 G-09 B-06 Cz-08 M-01 B-01 Ru-01 R-06 Sl-01 Lk-02 I-03 G-08 I-05 M-05 Hw-03 G-01 Cz-02 Cz-05 R-05 Sl-02 H-04 Lt-06 T-01 Cz-01
483 2330 413 403 518 468 480 1509 550 556 522 557 545 1289 559 845 593 653 687 677 2062 761 753 695 756 1633 854 965 1129 3620 3507 1095 984 999 1032 1371 936 829 1551 1044
286 289 290 291 292 297 301 301 313 317 323 324 326 326 332 361 378 379 388 389 398 400 422 432 457 479 483 509 512 514 521 530 543 556 567 603 609 609 611 638
4 9 3 4 5 7 4 7 6 5 5 5 6 8 7 6 4 5 4 4 6 5 5 4 5 6 5 5 5 7 7 8 5 5 5 7 5 4 6 6
0.9406 0.8443 0.969 0.9897 0.9418 0.9798 0.9568 0.887 0.9617 0.9653 0.9721 0.9506 0.9724 0.9202 0.9729 0.9584 0.9471 0.9314 0.9459 0.9460 0.8417 0.9475 0.9621 0.9537 0.9716 0.9207 0.9648 0.9497 0.9395 0.8268 0.8541 0.9679 0.9613 0.9622 0.9612 0.9436 0.9858 0.9753 0.9509 0.9749
R2 = G(k) Text ID 0.9126 0.7042 0.9535 0.9622 0.8990 0.8973 0.9302 0.8056 0.9042 0.9259 0.9334 0.9120 0.9172 0.8220 0.8991 0.9085 0.9259 0.8984 0.9253 0.9254 0.7965 0.9163 0.9325 0.9352 0.9442 0.8831 0.9389 0.9251 0.9151 0.7791 0.8071 0.9075 0.9383 0.9397 0.9392 0.9030 0.9653 0.9622 0.9214 0.9467
Mr-154 Mr-003 Mr-024 Mr-018 Ru-03 Mr-021 Mr-008 E-12 Mr-020 Mr-016 Kn-003 Kn-012 Mr-017 Mr-034 Mr-023 Lt-04 Mr-294 Mr-015 Mr-291 Mr-043 Mr-296 Mr-293 Mr-005 Mr-026 Mr-036 Mr-288 Kn-017 Mr 292 I-02 Lt-01 Mr-033 Sl-05 Mr-297 Mr-289 Mr-290 Mr-295 Lt-02 Kn-016 Mr-032 Mr-028
N
V
k
G(k)
R2 = G(k)
3601 4140 4255 4062 3853 3846 5477 9088 3943 3642 3188 4141 4170 3489 4142 4285 3825 4693 3954 3356 3836 3337 4957 4146 4205 4060 4316 4765 6064 3311 4339 5588 4605 4831 4025 4895 4010 4735 5195 5191
1719 1731 1731 1788 1792 1793 1807 1825 1825 1831 1833 1842 1853 1865 1872 1910 1931 1947 1957 1962 1970 2006 2029 2038 2070 2079 2122 2197 2203 2211 2217 2223 2278 2312 2319 2322 2334 2356 2382 2386
8 11 10 10 8 9 10 10 10 9 8 9 9 8 9 9 8 9 10 7 8 8 10 10 8 7 9 10 10 7 9 8 10 9 8 9 7 11 10 9
0.9715 0.9757 0.9711 0.9698 0.9749 0.9716 0.9596 0.9326 0.9189 0.9765 0.9847 0.9647 0.9703 0.9769 0.9722 0.9723 0.9746 0.9656 0.9806 0.976 0.9751 0.9835 0.9670 0.9789 0.9691 0.9673 0.9783 0.9782 0.9655 0.9887 0.9797 0.9627 0.9802 0.9719 0.9819 0.9733 0.9803 0.9148 0.9769 0.9640
0.9529 0.9407 0.9422 0.9418 0.9570 0.9490 0.9319 0.9052 0.8915 0.9544 0.9672 0.9427 0.9484 0.9597 0.9506 0.9511 0.9580 0.9448 0.9551 0.9635 0.9589 0.9675 0.9424 0.9544 0.9536 0.9555 0.9592 0.9554 0.9428 0.9776 0.9614 0.9483 0.9583 0.9544 0.9681 0.9559 0.9698 0.8891 0.9559 0.9470
(continued on next page)
The k-point
41
Table 3.9 (continued from previous page) Text ID
N
V
k
G(k)
Cz-10 T-03 Hw-05 R-03 Kn-004 T-02 R-04 Hw-04 H-02 Kn-013 R-01 Cz-07 Sl-03
1156 2054 7620 1264 1050 1827 1284 7892 1288 1302 1738 2014 1966
638 645 680 719 720 720 729 744 789 807 843 862 907
6 7 9 6 5 7 7 9 6 5 7 6 6
0.9608 0.9504 0.8309 0.9694 0.9778 0.9611 0.9726 0.8414 0.8099 0.9740 0.9632 0.9536 0.9592
R2 = G(k) Text ID 0.9326 0.9124 0.7713 0.9444 0.9604 0.9271 0.939 0.787 0.7871 0.9585 0.9341 0.9327 0.9394
N
V
Mr-009 6206 2387 Kn-006 5231 2433 Mr-004 6304 2451 Kn-005 4869 2477 Kn-011 4541 2516 Ru-04 6025 2536 Mr-031 5105 2617 Lt-03 4931 2703 Mr-040 5218 2877 Mr-030 5504 2911 I-01 11760 3667 Ru-05 17205 6073 Cz-06 1612 8405
k
G(k)
R2 = G(k)
12 10 10 9 8 9 8 9 8 9 14 12 5
0.9749 0.9786 0.9678 0.9802 0.9809 0.9720 0.9759 0.9815 0.9781 0.9780 0.9722 0.9740 0.9631
0.9447 0.9580 0.9474 0.9638 0.9682 0.9560 0.9637 0.9665 0.9670 0.9641 0.9455 0.9621 0.9616
As can be seen this index does not depend either on N or of V . Though the picture in Figure 3.6 (which shows the points of R2 according to V ) hints at an exponential curve, it is rather a cloud which cannot be captured exponentially.
Figure 3.6: The relationship between vocabulary size V and vocabulary richness R2
42 The h- and related points In Figure 3.7, where hN, R2 i is presented, R2 is rather a constant with individual textual variations. Hence we have again an independent index of richness.
Figure 3.7: The relationship between text length N and vocabulary richness R2
For the sake of comparison we have again the fact that G(k) is the only variable representing a proportion; hence, a normal test can be performed mechanically. Let us compare for illustration two Indonesian texts, In 05 and In 03, having G(k) and G(k) as given in Table 3.9 (p. 39). The variance of G(k) is given as Var[G(k)] = Var[G(k) − k2 /(2V )] = Var[G(k)] because the additive component is a constant. But Var[G(k)] = G(k)[1 − G(k)]/V , hence our criterion is z= s
G (k)1 − G (k)2 G (k)1 [1 − G (k)1 ] G (k)2 [1 − G (k)2 ] + V1 V2
.
(3.13)
The k-point
43
A slightly better approximation might be achieved by taking the mean of the G(k)’s but our numbers are always very large, thus it is sufficient to use the above formula. For the two Indonesian texts we obtain
z= r
0.8325 − 0.8995
0.8989 (1 − 0.8989) 0.9639 (1 − 0.9639) + 188 194
= −2.60
signalizing a significant difference between these texts, even if the difference is relatively small. Hence text In 03 is significantly richer than In 05. Performing the same procedure as in Section 3.1.1, and computing the mean richness R2 for our texts, we get almost the same picture: though the order of languages is slightly changed, as can be seen in Table 3.10, strongly isolating languages have a small number of different morphological forms and strongly agglutinating or inflectional languages a great number of forms. Hence, index R2 is again a kind of morphological index and can be used for linguistic (e.g. typological) purposes. Table 3.10: Mean richness in individual languages Language Samoan Rarotongan Marquesan Hawaiian Maori Lakota Indonesian German English Bulgarian
Mean R2 0.75 0.76 0.77 0.77 0.79 0.87 0.89 0.91 0.91 0.91
Language
Mean R2
Tagalog Hungarian Slovenian Italian Czech Latin Romanian Marathi Russian Kannada
0.92 0.93 0.93 0.93 0.94 0.94 0.94 0.95 0.95 0.95
44 The h- and related points 3.2.3 Indicator b In Section 3.1 we defined a as an indicator which relativizes the influence of text length N on the h-point. By analogy, we can now define the same indicator used for frequency spectra, namely b=
V , k2
(3.14)
relativizing the possible influence both of the vocabulary V and of text length N. In Figure 3.8 one can see that there is no relationship; however, for individual languages it may turn out to be present. While a could rather be presented as a dense cloud, the b-indicator displays such a large dispersion that nothing can be said about its behaviour with regard to a text. It seems to depend on author, style, language, or other boundary conditions which are not known.
Figure 3.8: Indicator b in relation to text length N
In order to scrutinize the problem at least in one way, we present the complete data in Table 3.11.
The k-point
45
Table 3.11: Index b = V /k2 (the analogue of index a = N/h2 ) of 176 texts in 20 languages Text ID
N
V
k
b = V /k2
Text ID
N
V
k
b = V /k2
B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01 G-02
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8193 9088 11265 1095 845
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495 1176 1597 985 1574 1333 1669 1825 1659 530 361
5 4 4 4 4 4 5 3 6 5 6 5 7 5 5 5 6 4 5 6 6 7 8 7 7 9 7 12 8 10 11 10 11 8 6
16.00 12.56 17.81 17.88 14.88 24.25 12.96 19.89 8.69 12.68 17.72 21.72 26.00 12.92 22.24 33.6 23.94 24.31 10.36 17.72 26.08 20.76 15.64 25.14 30.51 14.52 32.59 6.84 24.59 13.33 13.79 18.25 13.71 8.28 10.03
Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-001 Mr-002 Mr-003 Mr-004 Mr-005 Mr-006 Mr-007 Mr-008 Mr-009 Mr-010 Mr-015 Mr-016 Mr-017 Mr-018 Mr-020 Mr-021 Mr-022 Mr-023 Mr-024 Mr-026 Mr-027 Mr-028 Mr-029 Mr-030 Mr-031
1354 829 2062 1175 1434 1289 3620 2330 457 1509 2998 2922 4140 6304 4957 3735 3162 5477 6206 5394 4693 3642 4170 4062 3943 3846 4099 4142 4255 4146 4128 5191 3424 5504 5105
909 609 398 277 277 326 514 289 150 301 1555 1186 1731 2451 2029 1503 1262 1807 2387 1650 1947 1831 1853 1788 1825 1793 1703 1872 1731 2038 1400 2386 1412 2911 2617
6 4 6 6 8 8 7 9 5 7 9 7 11 10 10 10 8 10 12 10 9 9 9 10 10 9 9 9 10 10 9 9 9 9 8
25.25 38.06 11.06 7.69 4.33 5.09 10.49 3.57 6.00 6.14 19.20 24.20 14.31 24.51 20.29 15.03 19.72 18.07 16.58 16.50 24.04 22.60 22.88 17.88 18.25 22.14 21.02 23.11 17.31 20.38 17.28 29.46 17.43 35.94 40.89
(continued on next page)
46 The h- and related points Table 3.11 (continued from previous page) Text ID
N
V
k
b = V /k2
Text ID
N
V
k
b = V /k2
G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01 In-02 In-03 In-04 In-05 Kn-003 Kn-004 Kn-005 Kn-006
500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936 413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376 373 347 343 414 3188 1050 4869 5231
281 269 332 326 169 509 379 301 297 169 253 129 378 292 124 1079 789 291 609 290 104 257 521 744 680 1039 3667 2203 483 1237 512 221 209 194 213 188 1833 720 2477 2433
5 4 5 6 4 5 5 4 7 4 4 3 4 5 3 7 6 4 5 3 5 8 7 9 9 10 14 10 5 6 5 3 4 5 4 5 8 5 9 10
11.24 16.81 13.28 9.06 10.56 20.36 15.16 18.81 6.06 10.56 15.81 14.33 23.63 11.68 13.78 22.02 21.92 18.19 24.36 32.22 4.16 4.02 10.63 9.19 8.40 10.39 18.71 22.03 19.32 34.36 20.48 24.56 13.06 7.76 13.31 7.52 28.64 28.8 30.58 24.33
Mr-032 Mr-033 Mr-034 Mr-035 Mr-036 Mr-038 Mr-040 Mr-043 Mr-046 Mr-052 Mr-149 Mr-150 Mr-151 Mr-154 Mr-288 Mr-289 Mr-290 Mr-291 Mr-292 Mr-293 Mr-294 Mr-295 Mr-296 Mr-297 R-01 R-02 R-03 R-04 R-05 R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05
5195 4339 3489 1862 4205 4078 5218 3356 4186 3549 2946 3372 4843 3601 4060 4831 4025 3954 4765 3337 3825 4895 3836 4605 1738 2279 1264 1284 1032 695 968 845 892 625 1059 753 2595 3853 6025 17205
2382 2217 1865 1115 2070 1607 2877 1962 1458 1628 1547 1523 1702 1719 2079 2312 2319 1957 2197 2006 1931 2322 1970 2278 843 1179 719 729 567 432 223 214 207 181 197 422 1240 1792 2536 6073
10 9 8 7 8 9 8 7 10 9 8 8 10 8 7 9 8 10 10 8 8 9 8 10 7 8 6 7 5 4 6 5 8 7 6 5 7 8 9 12
23.82 27.37 29.14 22.76 32.34 19.84 44.95 40.04 14.58 20.10 24.17 23.80 17.02 26.86 42.43 28.54 36.23 19.57 21.97 31.34 30.17 28.67 30.78 22.78 17.20 18.42 19.97 14.88 22.68 27.00 6.19 8.56 3.23 3.69 5.47 16.88 25.31 28.00 31.31 42.17
(continued on next page)
The k-point
47
Table 3.11 (continued from previous page) Text ID
N
V
k
b = V /k2
Text ID
N
V
k
b = V /k2
Kn-011 Kn-012 Kn-013 Kn-016 Kn-017 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04
4541 4141 1302 4735 4316 345 1633 809 219 3311 4010 4931 4285
2516 1842 807 2356 2122 174 479 272 116 2211 2334 2703 1910
8 9 5 11 9 4 6 6 4 7 7 9 9
39.31 22.74 32.28 19.47 26.20 10.88 13.31 7.56 7.25 45.12 47.63 33.37 23.58
Sl-01 Sl-02 Sl-03 Sl-04 Sl-05 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
457 603 907 1102 2223 267 222 140 153 124 611 720 645
5 7 6 8 8 6 6 4 4 3 6 7 7
18.28 12.31 25.19 17.22 34.73 7.42 6.17 8.75 9.56 13.78 16.97 14.69 13.16
Due to the fact that Table 3.11 presents the data ordered according to language, no order can be seen. Table 3.12 orders the results in a different way. Table 3.12: Comparison of mean values of indices a and b in 20 languages Index b = V /k2 compare to Index a = N/h2 Language Mean b ⇒ Language Mean a Marquesan Rarotongan Maori Hawaiian Samoan Lakota Indonesian German Tagalog Bulgarian English Romanian Czech Slovenian Italian Hungarian Marathi Kannada Russian Latin
5.24 5.43 7.73 7.80 9.13 9.75 13.24 13.50 14.94 15.76 19.67 20.03 21.05 21.55 22.98 23.74 24.45 28.04 28.73 35.50
Samoan Rarotongan Hawaiian Maori Lakota Marquesan Tagalog English Bulgarian German Italian Romanian Slovenian Indonesian Russian Czech Marathi Kannada Hungarian Latin
4.56 5.02 5.37 5.53 5.69 5.69 7.24 7.65 7.81 8.39 8.41 9.15 9.19 9.58 10.10 10.33 11.82 16.58 18.02 19.56
48 The h- and related points Taking the means of b for individual languages, and ordering them according to increasing b and compare them with the order of languages according to index a, we can see that by and large b expresses the same fact: the greater b, the greater the synthetism of language. Analytic languages are at the beginning of the scale, increasing synthetism at the other end. Thus Polynesian languages are at the beginning and Latin is at the other end (cf. Table 3.12). Evidently, a much more thorough investigation is necessary in order to reveal finer relationships concerning typological properties of language and, with the given level of b, to find individual differences. Some authors suspect that language contacts and the development of language can also cause differences in a and/or b. Tests for b can be performed in the same way as for a.
The m -point
3.3
Consider again the rank frequency distribution of words. It is possible to arrange them in cumulative order, i.e. we add the relative frequencies up to the given r to get F(r) = P(R ≤ r). Though the symbol P is commonly used for probability, we ‘misuse’ it here for relative frequencies. Furthermore, in order to obtain a unique image for all texts, we consider also the ranks as relative values rrel , i.e. rrel = r/rmax . Since rmax = V , we obtain rrel = r/V . Let us illustrate this kind of counting, using Goethe’s Erlkönig (cf. Popescu & Altmann 2006). Table 3.13: Rank frequency distribution of words in Goethe’s Erlkönig Rank r
Frequency f (r)
Rank r
Frequency f (r)
Rank r
Frequency f (r)
Rank r
Frequency f (r)
1 2 3 4 5 6 7 8 9 10
11 9 9 7 6 6 5 5 4 4
11 12 13 14 15 16 17 18 19 20
4 4 4 4 4 3 3 3 3 3
21 22 23 24 25 26 27 28 29 30
3 2 2 2 2 2 2 2 2 2
31 32 33 34 35 36 37 38 39 40-124*
2 2 2 2 2 2 2 2 2 1
*
The ranks 40 to 124 have frequency 1
49
The m-point
Whereas Table 3.13 presents the rank frequency distribution of words in this text, Table 3.14 presents the cumulative distribution of ranked word forms of Goethe’s Erlkönig; both relative ranks rrel , obtained by way of the vision of each r by V = 124, and the cumulative relative frequencies (Fr,rel ) are given. Table 3.14: Cumulative distribution of ranked word forms of Goethe’s Erlkönig rrel
Fr,rel
rrel
Fr,rel
rrel
Fr,rel
rrel
Fr,rel
0.0081 0.0161 0.0242 0.0322 0.0403 0.0484 0.0565 0.0645 0.0726 0.0807
0.0489 0.0889 0.1289 0.1600 0.1867 0.2133 0.2356 0.2578 0.2756 0.2933
0.0887 0.0968 0.1048 0.1129 0.1210 0.1290 0.1371 0.1452 0.1532 0.1613
0.3111 0.3289 0.3467 0.3644 0.3822 0.3956 0.4089 0.4222 0.4356 0.4489
0.1693 0.1774 0.1855 0.1935 0.2016 0.2097 0.2177 0.2258 0.2339 0.2419
0.4622 0.4711 0.4800 0.4889 0.4978 0.5067 0.5156 0.5244 0.5333 0.5422
0.2500 0.2581 0.2661 0.2742 0.2823 0.2903 0.2983 0.3065 0.3145
0.5511 0.5600 0.5689 0.5778 0.5867 0.5956 0.6044 0.6133 0.6222
*
**
* **
from 40 up to 124 by step 1/124 = 0.0081 by step 0.00444444
The Fr,rel values are per definitionem monotonously increasing, as can be seen in Figure 2.4 (p. 12, Chapter 2). Since both rrel and Fr,rel have their maxima in 1, there is a pair of values (rrel , Fr,rel ) which is nearest to h0, 1i. Let us call it m-point (cf. Popescu & Altmann 2006). It can be found by computing the minimum of q 2 + (1 − F 2 Dr = rrel (3.15) r,rel ) . Analyzing the values in Table 3.14, it can easily be seen that we obtain D1 = [0.00812 + (1 − 0.0489)2 ]1/2 = 0.9511 D2 = 0.9113 D3 = 0.8714 D4 = 0.8406 ........... D38 = 0.4934 D39 = 0.4916 • D40 = 0.4934 ........... D124 = 1.0000
50 The h- and related points The smallest distance is at rank m = 39 (relative rank mr = 39/124 = 0.3145) where frequency 2 occurred for the last time, and it is F(m) = 0.4916. In this case, it is the boundary to hapax legomena. Let us investigate this point in different texts and languages and study its dependence on N. In Table 3.15 one finds a survey of results presented graphically in Figure 3.9 (see p. 53). In Table 3.15 we present also the indicator R3 which will be treated in Section 3.3.1 (p. 52ff.). Table 3.15: The m-point, min Dr , the cumulative relative frequencies F(m) up to this point, and the indicator R3 for 176 texts in 20 languages Text ID N B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083
V
m
min D
Fm
R3
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495 1176 1597 985
101 49 72 74 61 94 82 55 77 78 173 130 309 89 131 179 232 103 70 149 239 207 207 271 293 234 307 194
0.4670 0.4959 0.4846 0.5095 0.5057 0.4918 0.5028 0.5554 0.4946 0.4953 0.5215 0.4832 0.4157 0.5262 0.4863 0.4621 0.4145 0.4986 0.4918 0.4832 0.4060 0.3605 0.3626 0.3518 0.3590 0.3388 0.3568 0.3188
0.6071 0.5682 0.5864 0.5611 0.5640 0.5721 0.5655 0.5373 0.5709 0.5701 0.5546 0.5803 0.6624 0.5517 0.5746 0.5900 0.6847 0.5775 0.5891 0.5770 0.6837 0.7025 0.7022 0.7254 0.6992 0.7258 0.6994 0.7494
0.3929 0.4318 0.4136 0.4389 0.4360 0.4279 0.4345 0.4627 0.4291 0.4299 0.4454 0.4197 0.3376 0.4483 0.4254 0.4100 0.3153 0.4225 0.4109 0.4230 0.3163 0.2975 0.2978 0.2746 0.3008 0.2742 0.3006 0.2506
Text ID N Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-001 Mr-002 Mr-003 Mr-004 Mr-005 Mr-006 Mr-007 Mr-008 Mr-009 Mr-010 Mr-015 Mr-016 Mr-017 Mr-018 Mr-020 Mr-021 Mr-022 Mr-023
1354 829 2062 1175 1434 1289 3620 2330 457 1509 2998 2922 4140 6304 4957 3735 3162 5477 6206 5394 4693 3642 4170 4062 3943 3846 4099 4142
V 909 609 398 277 277 326 514 289 150 301 1555 1186 1731 2451 2029 1503 1262 1807 2387 1650 1947 1831 1853 1788 1825 1793 1703 1872
m
min D
Fm
R3
282 0.5574 0.5369 0.4631 214 0.5920 0.5235 0.4765 87 0.3218 0.7638 0.2362 66 0.3486 0.7455 0.2545 61 0.3272 0.7580 0.2420 67 0.3405 0.7285 0.2715 97 0.2848 0.7867 0.2133 66 0.3220 0.7730 0.2270 36 0.3909 0.6915 0.3085 66 0.3405 0.7396 0.2604 426 0.4657 0.6234 0.3766 321 0.4288 0.6674 0.3326 486 0.4380 0.6638 0.3362 627 0.4170 0.6707 0.3293 556 0.4322 0.6657 0.3343 407 0.4318 0.6637 0.3363 335 0.4261 0.6667 0.3333 414 0.3811 0.6955 0.3045 607 0.4173 0.6692 0.3308 415 0.3752 0.7215 0.2785 523 0.4209 0.6759 0.3241 504 0.4566 0.6356 0.3644 544 0.4418 0.6698 0.3302 508 0.4298 0.6775 0.3225 496 0.4369 0.6711 0.3289 549 0.4490 0.6716 0.3284 452 0.4151 0.6809 0.3191 553 0.4406 0.6731 0.3269 (continued on next page)
The m-point
51
min D
R3
Table 3.15 (continued from previous page) Text ID
N
V
m
min D
Fm
R3
E-09 E-10 E-11 E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01 In-02
5701 6246 8193 9088 11265 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 963 413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376 373
1574 1333 1669 1825 1659 530 361 281 269 332 326 169 509 379 301 297 169 253 129 378 292 124 1079 789 291 609 290 104 257 521 744 680 1039 3667 2203 483 1237 512 221 209
325 278 334 337 298 134 105 67 83 87 86 49 129 96 85 85 53 76 43 109 76 39 235 215 100 182 96 31 54 100 118 111 156 675 484 117 297 151 57 61
0.3335 0.3216 0.3169 0.2945 0.2898 0.4413 0.4490 0.4899 0.4601 0.5106 0.5133 0.5406 0.4683 0.5020 0.5313 0.5358 0.5585 0.4882 0.5741 0.5375 0.4916 0.4916 0.4668 0.5224 0.5854 0.5582 0.5747 0.4315 0.3186 0.2886 0.2399 0.2386 0.2208 0.3477 0.3737 0.4923 0.3972 0.4383 0.5067 0.4926
0.7381 0.7552 0.7543 0.7706 0.7726 0.6384 0.6580 0.5720 0.6587 0.5617 0.5596 0.5437 0.6062 0.5666 0.5500 0.5470 0.5378 0.6152 0.5326 0.5464 0.5830 0.6222 0.5871 0.5543 0.5261 0.5286 0.5302 0.6869 0.7605 0.7844 0.8199 0.8260 0.8381 0.7050 0.6977 0.5714 0.6835 0.6758 0.5638 0.6032
0.2619 0.2448 0.2457 0.2294 0.2274 0.3616 0.3420 0.4280 0.3413 0.4383 0.4404 0.4563 0.3938 0.4334 0.4500 0.4530 0.4622 0.3848 0.4674 0.4536 0.4170 0.3778 0.4129 0.4457 0.4739 0.4714 0.4698 0.3131 0.2395 0.2156 0.1801 0.1740 0.1619 0.2950 0.3023 0.4286 0.3165 0.3242 0.4362 0.3968
Text ID N Mr-024 Mr-026 Mr-027 Mr-028 Mr-029 Mr-030 Mr-031 Mr-032 Mr-033 Mr-034 Mr-035 Mr-036 Mr-038 Mr-040 Mr-043 Mr-046 Mr-052 Mr-149 Mr-150 Mr-151 Mr-154 Mr-288 Mr-289 Mr-290 Mr-291 Mr-292 Mr-293 Mr-294 Mr-295 Mr-296 Mr-297 R-01 R-02 R-03 R-04 R-05 R-06 Rt-01 Rt-02 Rt-03
4255 4146 4128 5191 3424 5504 5105 5195 4339 3489 1862 4205 4078 5218 3356 4186 3549 2946 3372 4843 3601 4060 4831 4025 3954 4765 3337 3825 4895 3836 4605 1738 2279 1264 1284 1032 695 968 845 892
V 1731 2038 1400 2386 1412 2911 2617 2382 2217 1865 1115 2070 1607 2877 1962 1458 1628 1547 1523 1702 1719 2079 2312 2319 1957 2197 2006 1931 2322 1970 2278 843 1179 719 729 567 432 223 214 207
m
Fm
469 0.4297 0.6665 0.3335 551 0.4491 0.6413 0.3587 350 0.3932 0.6965 0.3035 670 0.4337 0.6694 0.3306 387 0.4312 0.6671 0.3329 706 0.4683 0.5994 0.4006 682 0.4600 0.6210 0.3790 659 0.4319 0.6683 0.3317 628 0.4630 0.6338 0.3662 495 0.4740 0.6073 0.3927 294 0.5138 0.5591 0.4409 568 0.4504 0.6428 0.3572 421 0.4231 0.6677 0.3323 671 0.4828 0.5772 0.4228 500 0.5047 0.5644 0.4356 378 0.4057 0.6880 0.3120 484 0.4398 0.6760 0.3240 482 0.4772 0.6385 0.3615 465 0.4555 0.6619 0.3381 397 0.4004 0.6746 0.3254 535 0.5327 0.6712 0.3288 544 0.4598 0.6219 0.3781 655 0.4449 0.6570 0.3430 578 0.4992 0.5675 0.4325 546 0.4530 0.6431 0.3569 650 0.4392 0.6753 0.3247 532 0.5152 0.5583 0.4417 548 0.4596 0.6384 0.3616 697 0.4476 0.6680 0.3320 509 0.4602 0.6191 0.3809 607 0.4502 0.6371 0.3629 236 0.4476 0.6507 0.3493 270 0.4599 0.6011 0.3989 176 0.4944 0.5704 0.4296 178 0.4937 0.5709 0.4291 142 0.4820 0.5882 0.4118 120 0.5279 0.5511 0.4489 51 0.3350 0.7552 0.2448 45 0.3365 0.7373 0.2627 52 0.3706 0.7276 0.2724 (continued on next page)
52 The h- and related points Table 3.15 (continued from previous page) Text ID N In-03 In-04 In-05 Kn-003 Kn-004 Kn-005 Kn-006 Kn-011 Kn-012 Kn-013 Kn-016 Kn-017 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04
347 343 414 3188 1050 4869 5231 4541 4141 1302 4735 4316 345 1633 809 219 3311 4010 4931 4285
V
m
min D
Fm
R3
Text ID
N
V
m
min D
Fm
R3
194 213 188 1833 720 2477 2433 2516 1842 807 2356 2122 174 479 272 116 2211 2334 2703 1910
63 67 57 459 230 692 745 642 527 224 660 608 46 91 57 35 682 591 653 520
0.4980 0.5293 0.4542 0.4985 0.5655 0.4609 0.4493 0.4852 0.4307 0.5268 0.4547 0.4529 0.4556 0.3464 0.3785 0.4773 0.5553 0.5030 0.3740 0.3945
0.6225 0.5743 0.6618 0.5690 0.5333 0.6334 0.6712 0.5873 0.6781 0.5522 0.6418 0.6492 0.6290 0.7103 0.6848 0.6301 0.5382 0.5653 0.7185 0.7144
0.3775 0.4257 0.3382 0.4310 0.4667 0.3666 0.3288 0.4127 0.3219 0.4478 0.3582 0.3508 0.3710 0.2897 0.3152 0.3699 0.4618 0.4347 0.2815 0.2856
Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl 01 Sl 02 Sl 03 Sl-04 Sl-05 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
625 1059 753 2595 3853 6025 17205 756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
181 197 422 1240 1792 2536 6073 457 603 907 1102 2223 267 222 140 153 124 611 720 645
44 48 105 293 426 668 1289 122 171 255 218 553 50 49 33 36 29 145 172 119
0.3732 0.3609 0.4890 0.4348 0.4269 0.4090 0.3680 0.5173 0.4283 0.4348 0.3638 0.3994 0.3152 0.3408 0.3480 0.3394 0.3663 0.3829 0.3856 0.3472
0.7168 0.7337 0.5790 0.6351 0.6455 0.6871 0.6994 0.5569 0.6791 0.6684 0.6946 0.6875 0.7465 0.7404 0.7439 0.7554 0.7181 0.6995 0.6974 0.7059
0.2832 0.2663 0.4210 0.3649 0.3545 0.3129 0.3006 0.4431 0.3209 0.3316 0.3054 0.3125 0.2535 0.2596 0.2561 0.2446 0.2819 0.3005 0.3026 0.2941
As can easily be seen, there is a trend to increase F(m) with increasing N but there is no function that could capture the enormous dispersion of points in a satisfactory way. The trend is probably exponential and could be eventually used for single languages or genres, yet not for texts overall. But if we take e.g. Marathi, the language from which we have the greatest number of texts, we obtain for F(m) a horizontal straight line. Thus F(m) is a characteristic of an individual text. It can also be seen in Figure 3.9 that a dependence on N is not conspicuous. 3.3.1 The m-coverage of text and vocabulary richness The m-point divides the words in two parts, in the same way as the h-point did. The part F(m) represents the m-coverage of the text, i.e. the percentage of frequencies up to the m-point (covering the m-part of the text). Rank m is an objectively definable quantity. It changes with increase of N but min D need not. The same holds for F(m). The complement to F(m), namely 1 − F (m) = R3
(3.16)
The m-point
53
Figure 3.9: m-coverage of texts
is again a measure of vocabulary richness, because the words in this domain are repeated once or twice and build the real richness of the text (see Table 3.15). Consider the following situation: Each word would occur exactly once warranting the maximum richness; in that case F(m) would represent a straight line connecting [0, 0] with [1, 1] – if we relativize the ranks. The nearest point 1/2 to [0, 1] is of course [0.5, 0.5] yielding min D = −0.52 + 0.52 = 0.51/2 = 0.71. If some words occur more than once, the straight line changes into a concave curve and comes nearer to [0, 1]. The smaller min D, the greater F(m) and the m-coverage. But the greater F(m), the smaller 1 − F(m), i.e. the vocabulary richness. Thus the more words have greater frequency than one, the smaller will be the vocabulary richness. These concepts have a clear linguistic interpretation and can be used in this sense. As can be seen in Figure 3.10 the dependence on N is almost not existing. Though the differences between the min D or F(m) values are small, a test can be set up without any problems. We restrict ourselves to F(m), while for 1 − F(m) the test can be made analogically. Since F(m) is a proportion whose variance is F(m)(1 − F(m))/N, one can compare the m-coverages of
54 The h- and related points
Figure 3.10: The relation of R3 to text length N
two texts using the usual normal criterion F(m)1 − F(m)2 z = p . Var(F(m)1 ) +Var(F(m)2 )
(3.17)
The test for difference in vocabulary richness yields the same formula but with opposite interpretation. If in (3.17) z > 1.96, then the first text has a greater m-coverage, but a smaller vocabulary richness and vice versa, if z < −1.96, the first text has a greater vocabulary richness but a smaller mcoverage.
3.4
Gini’s coefficient and the n -point
In some other sciences such as economy or sociology one frequently uses the so-called Gini’s coefficient, e.g. for the computation of the distribution of wealth. It can be shown that this coefficient is applicable also to textological concepts like coverage and richness (cf. Popescu & Altmann 2006). As a matter of fact, this is only a complement to Section 3.3 but since there are programs for computing Gini’s coefficient, the concept itself is more familiar and the n-point is a logical consequence of the previous sections, we present it
Gini’s coefficient and the n-point
55
here. To this end, we use the rank frequency distribution of words but perform two changes: 1. we begin to rank “from below” i.e. the smallest frequency obtains rank 1, the next (equal or greater) rank 2, etc., and the greatest frequency rank rmax (= V ); 2. we relativize both the ranks (rr = r/V ) and the frequencies (pr = fr /N), i.e. both variables lie in the interval < 0, 1 >. Now, if each word would occur exactly once, the sequence
would be a straight line practically between [0, 0] and [1, 1].2 This situation can be shown in tabular form in Table 3.16 where the word frequencies in Goethe’s Erlkönig were turned around, relativized and cumulated.
Table 3.16: Reverse ranking of frequencies in Goethe’sErlkönig (from Popescu & Altmann 2006) Rank r
Frequency fr
Relative rank rr
Relative frequency pr
Relative cumulative frequency Fr
1 1 1/124 = 0.00806 1/225 = 0.00444 0.00444 2 1 0.01613 0.00444 0.00888 3 1 0.02419 0.00444 0.01333 4 1 0.03226 0.00444 0.01778 5 1 0.04032 0.00444 0.02222 ........................................................................... 124 11 1.00000 11/225 = 0.04889 1.00000
If all words would have frequency 1, i.e. if there would be maximal vocabulary richness, the dependent variable would lie exactly on the straight line through points (0,0) and (1,1). Reduced vocabulary richness shifts the cumulative frequencies downwards, away from the diagonal. The greater the area between the diagonal and the sequence of cumulative frequencies, called the Lorenz curve, the smaller the vocabulary richness. Thus, as in the case above, we obtain a more objective measure of vocabulary richness if we compute the area between the diagonal and the Lorenz curve. The curve is shown in Figure 3.11.
2. As a matter of fact, between [1/V , 1/N] and [1, 1] but the first point is so near to [0,0] that the difference can be neglected.
56 The h- and related points
Figure 3.11: The Lorenz curve
The area between the Lorenz curve and the straight line consists of small trapezoids. The area of a trapezoid is given as A =
a+b h 2
(3.18)
where a and b are the parallel sides and h is the height. The height h = rri+1 − rri while the two sides are given as a = rri − Fi b = rri+1 − Fi+1 . Hence Ai =
(rri − Fi ) + (ri+1 − Fi+1 ) (rri+1 − rri ) 2
(3.19)
is the area of the corresponding trapezoid. For the first trapezoid we have (rr1 − F1 ) + (rr2 − F2 ) (rr2 − rr1 ) 2 = [(0.00806 − 0.00444) + (0.0161 − 0.00888)](0.0161 − 0.00806)/2
A1 =
= 0.0000435
Gini’s coefficient and the n-point
57
and Gini’s coefficients will be computed as follows. First we compute the sum of the trapezoids G1 =
V −1
∑
i=1
(rri − Fi ) + (ri+1 − Fi+1 ) (rri+1 − rri ) 2
(3.20)
yielding G1 = 0.1828. Then we compute the proportion of this area to the whole area under the straight line, i.e. G = 0.1828/0.5 = 0.3656. Fortunately, there are other methods for computing G without the necessity of reversing the frequencies and computing relative frequencies and cumulative frequencies. Consider V as the highest rank and N as text length, i.e. sum of all frequencies as given in Table 3.16 (V = 124, N = 225). Then we obtain G directly as ! 1 2 V G = V + 1 − ∑ r fr (3.21) V N r=1 yielding exactly the same value as the first procedure. Still other variants are known. For V ≫ 1 the following equality is approximately satisfied: G = 1−
2 V ∑ r fr . V N r=1
(3.22)
Gini’s coefficient shows the position of the text between maximal and minimal vocabulary richness. Of course, the situation will differ with lemmatized texts and with the frequency spectrum. In order to develop a test for the difference of two G’s we see that G in formula 3.22 can be written as G = 1 − 2µ1 /V , hence the variance is simply Var(G) =
4σ 2 , V 2N
(3.23)
where σ 2 is the variance of the independent variable (rank). Now since G represents the area between the diagonal and the Lorenz curve, the greater the area, the smaller the vocabulary richness. Hence we consider rather the complement and define R4 = 1 − G. (3.24)
58 The h- and related points The values of R4 for several texts are given in Table 3.17, with Diff(G) = (1 − G) − (1 − Gt ) = Gt − G and the fitting equation 1 − Gt = 1.181/N 0.111 ; the corresponding graph is shown in Figure 3.12 (p. 63). Table 3.17: Vocabulary richness R4 for 176 texts in 20 languages Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10
N
V
Sumr f
1−G
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495 1176 1597 985 1574 1333
88591 22215 45082 44222 30578 80661 57287 17028 53230 54366 214647 160290 889966 56837 164549 377804 419144 83076 37242 217569 504912 601971 626658 1031490 1363579 966341 1519366 770562 1552112 1319948
0.5796 0.6230 0.6108 0.6368 0.6287 0.6026 0.6318 0.7043 0.6152 0.6138 0.6429 0.5981 0.4881 0.6711 0.5907 0.5568 0.4817 0.6283 0.6213 0.5884 0.4605 0.3975 0.3846 0.3615 0.3826 0.3372 0.3796 0.3068 0.3453 0.3163
1 − Gt
Diff(G)
0.5655 0.0141 0.6160 0.0070 0.5905 0.0203 0.5947 0.0421 0.6063 0.0224 0.5719 0.0307 0.5854 0.0464 0.6349 0.0694 0.5862 0.0290 0.5855 0.0283 0.5460 0.0969 0.5496 0.0485 0.4882 -0.0001 0.5896 0.0815 0.5487 0.0420 0.5203 0.0365 0.5076 -0.0259 0.5729 0.0554 0.5980 0.0233 0.5398 0.0486 0.4994 -0.0389 0.4861 -0.0886 0.4814 -0.0968 0.4629 -0.1014 0.4614 -0.0788 0.4603 -0.1231 0.4588 -0.0792 0.4580 -0.1512 0.4522 -0.1069 0.4476 -0.1313 (continued on next page)
Gini’s coefficient and the n-point
59
Table 3.17 (continued from previous page) Text ID
N
V
Sumr f
1−G
E-11E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01
8193 9088 11265 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936 413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376
1669 1825 1659 530 361 281 269 332 326 169 509 379 301 297 169 253 129 378 292 124 1079 789 291 609 290 104 257 521 744 680 1039 3667 2203 483 1237 512 221
2087353 2348072 2472829 155288 81355 42765 41746 59008 57554 15377 142636 77017 48364 47262 15040 36255 8750 75808 47639 8962 622888 326381 43214 192029 43255 7664 74335 245268 592291 523790 1135689 7973075 2774597 124735 896809 151339 26883
0.3047 0.2826 0.2640 0.5333 0.5306 0.6052 0.5658 0.6329 0.6448 0.6860 0.5788 0.6198 0.6662 0.6767 0.7032 0.6191 0.7295 0.6737 0.6265 0.6344 0.5639 0.6411 0.7335 0.6721 0.7189 0.5130 0.3124 0.2666 0.2004 0.2007 0.1760 0.3695 0.4149 0.6027 0.4442 0.5217 0.6425
1 − Gt
Diff(G)
0.4344 -0.1297 0.4294 -0.1468 0.4193 -0.1553 0.5431 -0.0098 0.5589 -0.0283 0.5925 0.0127 0.5868 -0.0210 0.5852 0.0477 0.5868 0.0580 0.6363 0.0497 0.5508 0.0280 0.5752 0.0446 0.5952 0.0710 0.5968 0.0799 0.6396 0.0636 0.5980 0.0211 0.6620 0.0675 0.5814 0.0923 0.5901 0.0364 0.6474 -0.0130 0.5067 0.0572 0.5334 0.1077 0.6068 0.1267 0.5526 0.1195 0.6052 0.1137 0.6314 -0.1184 0.5130 -0.2006 0.4773 -0.2107 0.4362 -0.2358 0.4379 -0.2372 0.4150 -0.2390 0.4173 -0.0478 0.4491 -0.0342 0.5583 0.0444 0.4812 -0.0370 0.5413 -0.0196 0.6115 0.0310 (continued on next page)
60 The h- and related points Table 3.17 (continued from previous page) Text ID
N
V
Sumr f
1−G
In-02 In-03 In-04 In-05 Kn-003 Kn-004 Kn-005 Kn-006 Kn-011 Kn-012 Kn-013 Kn-016 Kn-017 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-001 Mr-002 Mr-003 Mr-004 Mr-005 Mr-006
373 347 343 414 3188 1050 4869 5231 4541 4141 1302 4735 4316 345 1633 809 219 3311 4010 4931 4285 1354 829 2062 1175 1434 1289 3620 2330 457 1509 2998 2922 4140 6304 4957 3735
209 194 213 188 1833 720 2477 2433 2516 1842 807 2356 2122 174 479 272 116 2211 2334 2703 1910 909 609 398 277 277 326 514 289 150 301 1555 1186 1731 2451 2029 1503
24767 21784 25668 22094 1836925 274373 3435219 3441046 3469724 1974970 350044 3132855 2549185 17273 145365 46708 7751 2553155 2873725 3964172 2075733 433038 191052 131813 59365 66274 75639 251332 103994 15370 76440 1349591 860423 1842506 3651695 2528250 1412525
0.6306 0.6420 0.6980 0.5624 0.6282 0.7245 0.5693 0.5403 0.6070 0.5173 0.6651 0.5612 0.5562 0.5697 0.3696 0.4208 0.6016 0.6971 0.6137 0.5945 0.5067 0.7026 0.7552 0.3187 0.3612 0.3301 0.3569 0.2682 0.3054 0.4418 0.3333 0.5783 0.4957 0.5136 0.4723 0.5023 0.5026
1 − Gt
Diff(G)
0.6121 0.0185 0.6170 0.0250 0.6178 0.0802 0.6050 -0.0426 0.4823 0.1459 0.5456 0.1789 0.4602 0.1091 0.4565 0.0838 0.4638 0.1432 0.4685 0.0488 0.5328 0.1323 0.4616 0.0996 0.4664 0.0898 0.6174 -0.0477 0.5195 -0.1499 0.5617 -0.1409 0.6493 -0.0477 0.4803 0.2168 0.4702 0.1435 0.4596 0.1349 0.4668 0.0399 0.5304 0.1722 0.5601 0.1951 0.5062 -0.1875 0.5389 -0.1777 0.5271 -0.1970 0.5333 -0.1764 0.4756 -0.2074 0.4994 -0.1940 0.5984 -0.1566 0.5241 -0.1908 0.4856 0.0927 0.4870 0.0087 0.4686 0.0450 0.4472 0.0251 0.4593 0.0430 0.4739 0.0287 (continued on next page)
Gini’s coefficient and the n-point
61
Table 3.17 (continued from previous page) Text ID
N
V
Sumr f
1−G
Mr-007 Mr-008 Mr-009 Mr-010 Mr-015 Mr-016 Mr-017 Mr-018 Mr-020 Mr-021 Mr-022 Mr-023 Mr-024 Mr-026 Mr-027 Mr-028 Mr-029 Mr-030 Mr-031 Mr-032 Mr-033 Mr-034 Mr-035 Mr-036 Mr-038 Mr-040 Mr-043 Mr-046 Mr-052 Mr-149 Mr-150 Mr-151 Mr-154 Mr-288 Mr-289 Mr-290 Mr-291
3162 5477 6206 5394 4693 3642 4170 4062 3943 3846 4099 4142 4255 4146 4128 5191 3424 5504 5105 5195 4339 3489 1862 4205 4078 5218 3356 4186 3549 2946 3372 4843 3601 4060 4831 4025 3954
1262 1807 2387 1650 1947 1831 1853 1788 1825 1793 1703 1872 1731 2038 1400 2386 1412 2911 2617 2382 2217 1865 1115 2070 1607 2877 1962 1458 1628 1547 1523 1702 1719 2079 2312 2319 1957
933718 2078761 3558246 1826299 1896378 1876019 2025797 1845804 1753961 1866518 1686761 2033442 1831108 2318433 1290280 3223199 1219981 4616687 3772670 3195847 2761959 1921843 679711 2391389 1604617 4461230 2071914 1418646 1524287 1369251 1397431 1862666 1706487 2387755 3013110 2916065 2143888
0.4672 0.4195 0.4800 0.4098 0.4146 0.5621 0.5238 0.5077 0.4869 0.5408 0.4827 0.5240 0.4966 0.5483 0.4458 0.5201 0.5040 0.5759 0.5644 0.5161 0.5738 0.5902 0.6539 0.5490 0.4891 0.5940 0.6288 0.4642 0.5270 0.6002 0.5436 0.4514 0.5508 0.5653 0.5391 0.6244 0.5536
1 − Gt
Diff(G)
0.4828 -0.0156 0.4542 -0.0347 0.4480 0.0320 0.4550 -0.0452 0.4621 -0.0475 0.4753 0.0868 0.4682 0.0556 0.4695 0.0382 0.4711 0.0158 0.4724 0.0684 0.4691 0.0136 0.4685 0.0555 0.4671 0.0295 0.4685 0.0798 0.4687 -0.0229 0.4569 0.0632 0.4785 0.0255 0.4540 0.1219 0.4578 0.1066 0.4569 0.0592 0.4661 0.1077 0.4775 0.1127 0.5120 0.1419 0.4677 0.0813 0.4693 0.0198 0.4567 0.1373 0.4796 0.1492 0.4680 -0.0038 0.4766 0.0504 0.4866 0.1136 0.4794 0.0642 0.4605 -0.0091 0.4759 0.0749 0.4696 0.0957 0.4606 0.0785 0.4700 0.1544 0.4710 0.0826 (continued on next page)
62 The h- and related points Table 3.17 (continued from previous page) Text ID
N
V
Sumr f
1−G
1 − Gt
Diff(G)
Mr-292 Mr-293 Mr-294 Mr-295 Mr-296 Mr-297 R-01 R-02 R-03 R-04 R-05 R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
4765 3337 3825 4895 3836 4605 1738 2279 1264 1284 1032 695 968 845 892 625 1059 753 2595 3853 6025 17205 756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
2197 2006 1931 2322 1970 2278 843 1179 719 729 567 432 223 214 207 181 197 422 1240 1792 2536 6073 457 603 907 1102 2223 267 222 140 153 124 611 720 645
2761512 2178852 2087238 3075979 2136424 2886154 397986 748846 276173 285187 175246 98302 37486 33013 36361 23452 40160 97463 839807 1753031 3608582 20916050 110978 210208 462398 746155 2809447 61622 44917 16323 20161 11584 207240 287164 245358
0.5271 0.6505 0.5647 0.5408 0.5649 0.5498 0.5421 0.5565 0.6064 0.6080 0.5972 0.6525 0.3428 0.3605 0.3890 0.4091 0.3799 0.6111 0.5212 0.5072 0.4720 0.4002 0.6402 0.5069 0.5175 0.3870 0.4519 0.3067 0.3411 0.3708 0.3515 0.4099 0.4357 0.4352 0.3688
0.4613 0.4799 0.4727 0.4599 0.4725 0.4631 0.5159 0.5007 0.5345 0.5336 0.5467 0.5712 0.5506 0.5589 0.5556 0.5780 0.5451 0.5661 0.4935 0.4723 0.4494 0.4000 0.5659 0.5297 0.5089 0.4775 0.4532 0.5250 0.5391 0.5788 0.5676 0.5999 0.5225 0.5157 0.5088
0.0658 0.1706 0.0920 0.0809 0.0924 0.0867 0.0262 0.0558 0.0719 0.0744 0.0505 0.0813 -0.2078 -0.1984 -0.1666 -0.1689 -0.1652 0.0450 0.0277 0.0349 0.0226 0.0002 0.0743 -0.0228 0.0086 -0.0905 -0.0013 -0.2183 -0.1980 -0.2080 -0.2161 -0.1900 -0.0868 -0.0805 -0.1400
It can be shown, however, that R4 has at least a slight association to N, as shown in Figure 3.12.
Gini’s coefficient and the n-point
63
Figure 3.12: The slight association of R4 to N
Even if this dependence is not strong – the confidence interval is too wide – direct comparisons are not quite satisfactory. In order to find a solution, we compute the curve R4 = f (N) which will, of course, change when one adds further texts but since their number is relatively great, the parameters are sufficiently stable. We obtain from the data R4t = 1 − Gt = 1.214N −0.114 The vocabulary richness can be estimated simply by the difference between the observed and the theoretical R4t as is shown in the last column of Table 3.17. These difference values are presented in Figure 3.13 and as a matter of fact they are the indicators of vocabulary richness.
3.4.1 The n-point In this case we have a simple analogy to the m-point but with reversed ranking. The n-point is the point on the Lorenz curve which is next to the point [1, 0]. The distance between the points on the curve can be computed as usually, using q D =
(1 − rr)2 + Frr2
(3.25)
64 The h- and related points
Figure 3.13: The difference between G and Gt
and min D signalizes the smallest distance. Now, since up to this point usually infrequent words occur, the complement to F(n) can be considered ncoverage of the text, i.e. Cn = 1 − F (n) . (3.26) For illustration we compute again the n-point for Goethe’s Erlkönig. We have N = 225, V = 124; at rank r = 85 there is the last frequency 1, i.e. rr84 = 84/124 = 0.6774, and F(84) = 84/225 = 0.3733 yielding h i1/2 = 0.4933, D = (1 − 0.6774)2 + 0.37332
rr85 = 85/124 = 0.6855, and F(85) = 85/225 = 0.3778 yielding h i1/2 D = (1 − 0.6855)2 + 0.37782 = 0.4916,
rr86 = 86/124 = 0.6935, and F(86) = 87/225 = 0.3867 yielding h i1/2 D = (1 − 0.6935)2 + 0.386722 = 0.4934,
hence the n-point is at r = 85. In this case it is the last hapax legomenon but it need not to be so in general. Since F(85) = 0.3778, the n-coverage is C85 = 1 − 0.3778 = 0.6222. Now, since in this section we have to do with reversed cumulative frequencies and in the previous section with plain ones, one must expect a kind
Gini’s coefficient and the n-point
65
of symmetry or complementarity. Consider first the graphical presentation of cumulative frequencies in Figure 3.14 showing that the same fact is presented in two ways.
Figure 3.14: Plain (F1 ) and reversed (F2 ) cumulative relative frequencies
The complementarity can be shown in a comparative Table 3.18 in which the m-point and n-point data are collected, with R3 = F(n) = 1 − F(m) (cf. Table 3.15) This kind of computation is possible only with rank frequencies, making no sense with the spectra in which the “natural” variable cannot be reversed. This is, at the same time, an argument for the relevance of ranking present in different other sciences. Table 3.18: The complementarity of m- and n-measures Text ID B-01 B-02 B-03 B-04
N
V
m
761 352 515 483
400 201 285 286
101 49 72 74
n =V −m 299 152 213 212
D(m) = D(n)
R3
0.4670 0.3929 0.4959 0.4318 0.4846 0.4136 0.5095 0.4389 (continued on next page)
66 The h- and related points Table 3.18 (continued from previous page) Text ID B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08
N
V
m
406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8193 9088 11265 1095 845 500 545 559 545 263 965
238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495 1176 1597 985 1574 1333 1669 1825 1659 530 361 281 269 332 326 169 509
61 94 82 55 77 78 173 130 309 89 131 179 232 103 70 149 239 207 207 271 293 234 307 194 325 278 334 337 298 134 105 67 83 87 86 49 129
n =V −m 177 294 242 124 236 239 465 413 965 234 425 661 630 286 189 489 700 810 794 961 1202 942 1290 791 1249 1055 1335 1488 1361 396 256 214 186 245 240 120 380
D(m) = D(n)
R3
0.5057 0.4360 0.4918 0.4279 0.5028 0.4345 0.5554 0.4627 0.4946 0.4291 0.4953 0.4299 0.5215 0.4454 0.4832 0.4197 0.4157 0.3376 0.5262 0.4483 0.4863 0.4254 0.4621 0.4100 0.4145 0.3153 0.4986 0.4225 0.4918 0.4109 0.4832 0.4230 0.4060 0.3163 0.3605 0.2975 0.3626 0.2978 0.3518 0.2746 0.3590 0.3008 0.3388 0.2742 0.3568 0.3006 0.3188 0.2506 0.3335 0.2619 0.3216 0.2448 0.3169 0.2457 0.2945 0.2294 0.2898 0.2274 0.4413 0.3616 0.4490 0.3420 0.4899 0.4280 0.4601 0.3413 0.5106 0.4383 0.5133 0.4404 0.5406 0.4563 0.4683 0.3938 (continued on next page)
Gini’s coefficient and the n-point
67
Table 3.18 (continued from previous page) Text ID
N
V
m
G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01 In-02 In-03 In-04 In-05 Kn-003 Kn-004 Kn-005 Kn-006 Kn-011 Kn-012 Kn-013
653 480 468 251 460 184 593 518 225 2044 1288 403 963 413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376 373 347 343 414 3188 1050 4869 5231 4541 4141 1302
379 301 297 169 253 129 378 292 124 1079 789 291 609 290 104 257 521 744 680 1039 3667 2203 483 1237 512 221 209 194 213 188 1833 720 2477 2433 2516 1842 807
96 85 85 53 76 43 109 76 39 235 215 100 182 96 31 54 100 118 111 156 675 484 117 297 151 57 61 63 67 57 459 230 692 745 642 527 224
n =V −m 283 216 212 116 177 86 269 216 85 844 574 191 427 194 73 203 421 626 569 883 2992 1719 366 940 361 164 148 131 146 131 1374 490 1785 1688 1874 1315 583
D(m) = D(n)
R3
0.5020 0.4334 0.5313 0.4500 0.5358 0.4530 0.5585 0.4622 0.4882 0.3848 0.5741 0.4674 0.5375 0.4536 0.4916 0.4170 0.4916 0.3778 0.4668 0.4129 0.5224 0.4457 0.5854 0.4739 0.5582 0.4714 0.5747 0.4698 0.4315 0.3131 0.3186 0.2395 0.2886 0.2156 0.2399 0.1801 0.2386 0.1740 0.2208 0.1619 0.3477 0.2950 0.3737 0.3023 0.4923 0.4286 0.3972 0.3165 0.4383 0.3242 0.5067 0.4362 0.4926 0.3968 0.4980 0.3775 0.5293 0.4257 0.4542 0.3382 0.4985 0.4310 0.5655 0.4667 0.4609 0.3666 0.4493 0.3288 0.4852 0.4127 0.4307 0.3219 0.5268 0.4478 (continued on next page)
68 The h- and related points Table 3.18 (continued from previous page) Text ID
N
V
m
Kn-016 Kn-017 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-001 Mr-002 Mr-003 Mr-004 Mr-005 Mr-006 Mr-007 Mr-008 Mr-009 Mr-010 Mr-015 Mr-016 Mr-017 Mr-018 Mr-020 Mr-021 Mr-022
4735 4316 345 1633 809 219 3311 4010 4931 4285 1354 829 2062 1175 1434 1289 3620 2330 457 1509 2998 2922 4140 6304 4957 3735 3162 5477 6206 5394 4693 3642 4170 4062 3943 3846 4099
2356 2122 174 479 272 116 2211 2334 2703 1910 909 609 398 277 277 326 514 289 150 301 1555 1186 1731 2451 2029 1503 1262 1807 2387 1650 1947 1831 1853 1788 1825 1793 1703
660 608 46 91 57 35 682 591 653 520 282 214 87 66 61 67 97 66 36 66 426 321 486 627 556 407 335 414 607 415 523 504 544 508 496 549 452
n =V −m 1696 1514 128 388 215 81 1529 1743 2050 1390 627 395 311 211 216 259 417 223 114 235 1129 865 1245 1824 1473 1096 927 1393 1780 1235 1424 1327 1309 1280 1329 1244 1251
D(m) = D(n)
R3
0.4547 0.3582 0.4529 0.3508 0.4556 0.3710 0.3464 0.2897 0.3785 0.3152 0.4773 0.3699 0.5553 0.4618 0.5030 0.4347 0.3740 0.2815 0.3945 0.2856 0.5574 0.4631 0.5920 0.4765 0.3218 0.2362 0.3486 0.2545 0.3272 0.2420 0.3405 0.2715 0.2848 0.2133 0.3220 0.2270 0.3909 0.3085 0.3405 0.2604 0.4657 0.3766 0.4288 0.3326 0.4380 0.3362 0.4170 0.3293 0.4322 0.3343 0.4318 0.3363 0.4261 0.3333 0.3811 0.3045 0.4173 0.3308 0.3752 0.2785 0.4209 0.3241 0.4566 0.3644 0.4418 0.3302 0.4298 0.3225 0.4369 0.3289 0.4490 0.3284 0.4151 0.3191 (continued on next page)
Gini’s coefficient and the n-point
69
Table 3.18 (continued from previous page) Text ID
N
V
m
Mr-023 Mr-024 Mr-026 Mr-027 Mr-028 Mr-029 Mr-030 Mr-031 Mr-032 Mr-033 Mr-034 Mr-035 Mr-036 Mr-038 Mr-040 Mr-043 Mr-046 Mr-052 Mr-149 Mr-150 Mr-151 Mr-154 Mr-288 Mr-289 Mr-290 Mr-291 Mr-292 Mr-293 Mr-294 Mr-295 Mr-296 Mr-297 R-01 R-02 R-03 R-04 R-05
4142 4255 4146 4128 5191 3424 5504 5105 5195 4339 3489 1862 4205 4078 5218 3356 4186 3549 2946 3372 4843 3601 4060 4831 4025 3954 4765 3337 3825 4895 3836 4605 1738 2279 1264 1284 1032
1872 1731 2038 1400 2386 1412 2911 2617 2382 2217 1865 1115 2070 1607 2877 1962 1458 1628 1547 1523 1702 1719 2079 2312 2319 1957 2197 2006 1931 2322 1970 2278 843 1179 719 729 567
553 469 551 350 670 387 706 682 659 628 495 294 568 421 671 500 378 484 482 465 397 535 544 655 578 546 650 532 548 697 509 607 236 270 176 178 142
n =V −m 1319 1262 1487 1050 1716 1025 2205 1935 1723 1589 1370 821 1502 1186 2206 1462 1080 1144 1065 1058 1305 1184 1535 1657 1741 1411 1547 1474 1383 1625 1461 1671 607 909 543 551 425
D(m) = D(n)
R3
0.4406 0.3269 0.4297 0.3335 0.4491 0.3587 0.3932 0.3035 0.4337 0.3306 0.4312 0.3329 0.4683 0.4006 0.4600 0.3790 0.4319 0.3317 0.4630 0.3662 0.4740 0.3927 0.5138 0.4409 0.4504 0.3572 0.4231 0.3323 0.4828 0.4228 0.5047 0.4356 0.4057 0.3120 0.4398 0.3240 0.4772 0.3615 0.4555 0.3381 0.4004 0.3254 0.5327 0.3288 0.4598 0.3781 0.4449 0.3430 0.4992 0.4325 0.4530 0.3569 0.4392 0.3247 0.5152 0.4417 0.4596 0.3616 0.4476 0.3320 0.4602 0.3809 0.4502 0.3629 0.4476 0.3493 0.4599 0.3989 0.4944 0.4296 0.4937 0.4291 0.4820 0.4118 (continued on next page)
70 The h- and related points Table 3.18 (continued from previous page) Text ID
N
V
m
R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
695 968 845 892 625 1059 753 2595 3853 6025 17205 756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
432 223 214 207 181 197 422 1240 1792 2536 6073 457 603 907 1102 2223 267 222 140 153 124 611 720 645
120 51 45 52 44 48 105 293 426 668 1289 122 171 255 218 553 50 49 33 36 29 145 172 119
3.5
n =V −m
D(m) = D(n)
312 172 169 155 137 149 317 947 1366 1868 4784 335 432 652 884 1670 217 173 107 117 95 466 548 526
0.5279 0.3350 0.3365 0.3706 0.3732 0.3609 0.4890 0.4348 0.4269 0.4090 0.3680 0.5173 0.4283 0.4348 0.3638 0.3994 0.3152 0.3408 0.3480 0.3394 0.3663 0.3829 0.3856 0.3472
R3 0.4489 0.2448 0.2627 0.2724 0.2832 0.2663 0.4210 0.3649 0.3545 0.3129 0.3006 0.4431 0.3209 0.3316 0.3054 0.3125 0.2535 0.2596 0.2561 0.2446 0.2819 0.3005 0.3026 0.2941
The role of N and V
Text length N and its vocabulary V usually exert influence on many quantitative characteristics. Ju.K. Orlov (cf. Orlov, Boroda & Nadarejšvili 1982) is of the opinion that the author “plans” or “aims at” a special text length before he begins to write. This planned length, called today Zipf-Orlov size controls the flow of information in text. This idea is surely true with regard to journalistic texts which are usually short but somewhat problematic with literary or scientific ones where a planned short story can become a novel or a planned short article can become a book. Also the problem of information flow is not unequivocal. There is speaker’s and hearer’s information which
The role of N and V
71
must be distinguished (cf. Andersen & Altmann 2005). Hence there are three ways to treat the dependence of text characteristics on length. 1. If a characteristic depends on N, then one can find a theoretical (or a preliminary empirical) function which corresponds to this dependence and define the characteristic as the deviation from the function. Since the deviations are both positive and negative, one can perform a transformation of these deviations and define a new characteristic. 2. One computes a theoretical function and sets one or more confidence intervals about the function. The texts can be classified or evaluated according to their presence in one of the zones. In both cases the theoretical function changes slightly after adding even one new text to the corpus under study. Of course, after having evaluated thousands of texts in thousands of languages the function will be very stable but this state is rather still illusory. Hence all endeavours in this direction must be considered as preliminary, unless one succeeds to derive the function (or the area) theoretically. Of course, the parameters of the function must also follow from theoretical considerations, a requirement which is still illusory in linguistics. 3. Construct a characteristic which does not depend on text length. This is not easy but in any case sometimes possible. Above, we have shown some of them. In the first two cases the direct difference between two indices from texts of different length is meaningless: if an index depends on N, then two indices are not directly commensurable. Nevertheless a test can be performed with the centralized index T = I − E(I), where E(I) is the expectation of the index given by the theoretical or (preliminarily computed) empirical function. A classification of texts is possible only using this transformed index. In the third case for any asymptotic comparison one needs the variance of the given index which stays always in some relation to N or to V as shown above. The disregard of the variance can lead to the development of e.g. type-token indices which in form of a function can adequately capture the course of TTR but are not suitable for direct comparisons because of somewhat monstrous variances, as shown by Wimmer and Altmann (1999). Another problem is the well known weakness of classical tests. They work well up to a certain sample size but increasing the size – a notorious situation in linguistics – can render any difference significant. Perhaps one should use more distribution free tests or indices like the determination coefficient whose value can directly evoke a decision without recourse to probability.
4 4.1
The geometry of word frequencies
Introduction
The points h, k, m and n which have been defined in the preceding chapters can be used not only related to some elementary questions asked in the field of textology (such as to vocabulary richness, or text coverage), but can also give us the possibility of characterizing the distribution in a geometric way. How did the writer manage to place the given restricted number of words (V ) in the texts? Consider first the rank frequency distribution. In very short texts it is possible to use each word only once. In this case the rank frequency distribution is represented by a simple straight line beginning at (1, 1) and ending at (V, 1) because all frequencies are f (r) = 1 and both coordinates of the h-point are equal to 1. Now take a text with the same conditions but with one word occurring twice and all other words only once. In this case we obtain three characteristic points: (1, 2) for the first word with frequency 2, (2, 1) for the second word with frequency 1, and (V, 1) for the last word occurring also once. These three points delimit a triangle which is characteristic for the given text. Now, if more words occur several times, f (1) becomes greater, i.e. the point (1, f (1)) gets higher; (V, 1) may retain its position or change according to the increased vocabulary, but the point (r, 1) (r = 1, 2, . . . ,V ) loses its importance, it turns out to be rather random. Then we again apply the h-point and connect the points P1 (V, 1), P2 (1, f (1)) and P3 (h, h) obtaining a triangle as shown in Figure 4.1 (taken from from Popescu & Altmann p2007a). We prefer here an integer value for h, though in many cases min r2 + f (r)2 is not an integer and then we take the first r ≥ f (r). The same situation arises when we analyse the frequency spectrum. Here we speak of frequencies g(x), in order to distinguish them from ranks. The three characteristic points are Q1 (W, 1), W being the number of non-zero frequency classes, further Q2 (1, g(1)) and Q3 (k, k) as can be seen in Figure 4.2. Remember that here frequency classes having frequency 0 are not taken into account. They will be taken into account for different purposes in Chapter 9. In the next two sections we shall analyse and try to interpret the behaviour of these triangles. It is to be noted that we consider “normal” texts written by native speakers in which P1 always has the coordinates (V, 1). One can
74 The geometry of word frequencies
Figure 4.1: The characteristic triangle of the rank frequency distribution
imagine texts without hapax legomena, e.g. principally infinite texts but in this case, we are concerned with mixed texts or with pathological texts which are of interest for psychology but not for linguistics. On the other hand, h can be 1 only in very short texts in which no repetition occurs. It is known from type-token research that the first repetitions tend to begin around the twentieth word within a running text.
Figure 4.2: The characteristic triangle of the frequency spectrum
The rank frequency distribution
4.2
75
The rank frequency distribution
The writer of a text aims to convey to the reader a certain amount of “hearer information” which must be imparted within the given text length using a certain number of different words. The purely formal shape of this aim is represented by the rank frequency distribution. The slightly irregular shape of the sequence could be approximated by a discrete or continuous distribution – this will be done in Chapter 9 – and the area under the straight line connecting P1 (V, 1) and P2 (1, f (1)) in Figure 4.1 could be computed by adding the necessary number of trapezoidal areas (in discrete case) or by the difference of two integrals (in continuous case). We simplify this task and compute the area of the triangle between the points P1 (V, 1), P2 (1, f (1)) and P3 (h, h) given as x1 y1 1 V 1 1 1 1 Ah = x2 y2 1 = 1 f (1) 1 (4.1) 2 2 x3 y3 1 h h 1 yielding
1 [V f (1) + 2h − h (V + f (1)) − 1] . (4.2) 2 For example in Table 4.1 we find for the English text E-13 by R.P. Feynman the data V = 1659, f (1) = 780, h = 41 from which Ah =
Ah =
1 [1659 (780) + 2 (41) − 41 (1659 + 780) − 1] = 597051. 2
The greatest triangle that can be obtained for fixed V and f (1) can be obtained if we replace P3 (h, h) by P3 (1, 1), i.e. Amax = (1/2) (V − 1) ( f (1) − 1) ,
(4.3)
in our example Amax = (1/2) (1659 − 1) (780 − 1) = 645791. The ratio A = Ah /Amax
(4.4)
shows the exploitation of the given vocabulary for the given aim (cf. Popescu & Altmann 2007a). This interpretation would be sufficient; yet we see at once
76 The geometry of word frequencies that the nearer the sequence of frequencies comes to the upper straight line, the more are the words exploited, i.e. the smaller is the vocabulary richness. However, the greater the area Ah , the more words are used only rarely and only a few words are used frequently, i.e. the greater is the vocabulary richness. Hence A is a further index of vocabulary richness. In our example we get A = 597051/645791 = 0.9245. In Table 4.1 one finds the values for the processed texts. Table 4.1: Index A for 176 texts in 20 languages Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05
N
V
f (1)
h
A
Text ID
N
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495
40 13 15 21 19 28 19 10 20 26 58 56 182 27 84 106 134 31 30 50 126 168 229 366 297
10 8 9 8 7 9 8 6 9 7 9 11 19 7 9 13 15 8 6 11 16 22 19 23 26
0.7467 0.3817 0.4004 0.6254 0.6414 0.6830 0.5894 0.4164 0.5533 0.7410 0.8471 0.7997 0.8864 0.7506 0.8892 0.8714 0.8785 0.7486 0.8082 0.7802 0.8640 0.8536 0.9031 0.9219 0.8988
Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-001 Mr-002 Mr-003 Mr-004 Mr-005 Mr-006 Mr-007 Mr-008 Mr-009 Mr-010 Mr-015 Mr-016 Mr-017 Mr-018 Mr-020
1354 829 2062 1175 1434 1289 3620 2330 457 1509 2998 2922 4140 6304 4957 3735 3162 5477 6206 5394 4693 3642 4170 4062 3943
V
f (1)
h
A
909 33 8 0.7735 609 19 7 0.6568 398 152 18 0.8446 277 127 15 0.8382 277 128 17 0.816 326 137 15 0.8540 514 234 26 0.8440 289 247 22 0.8417 150 42 10 0.7201 301 218 14 0.8968 1555 75 14 0.8160 1186 73 18 0.7495 1731 68 20 0.7054 2451 314 24 0.9171 2029 172 19 0.8859 1503 120 19 0.8368 1262 80 16 0.7982 1807 190 27 0.8480 2387 93 26 0.7178 1650 217 27 0.8639 1947 136 21 0.8416 1831 63 18 0.7165 1853 67 19 0.7176 1788 126 20 0.8374 1825 62 19 0.6950 (continued on next page)
77
The rank frequency distribution Table 4.1 (continued from previous page) Text ID
N
V
E-06 4862 1176 E-07 5004 1597 E-08 5083 985 E-09 5701 1574 E-10 6246 1333 E-118193 1669 E-12 9088 1825 E-13 11265 1659 G-01 1095 530 G-02 845 361 G-03 500 281 G-04 545 269 G-05 559 332 G-06 545 326 G-07 263 169 G-08 965 509 G-09 653 379 G-10 480 301 G-11 468 297 G-12 251 169 G-13 460 253 G-14 184 129 G-15 593 378 G-16 518 292 G-17 225 124 H-01 2044 1079 H-02 1288 789 H-03 403 291 H-04 936 609 H-05 413 290 Hw-01 282 104 Hw-02 1829 257 Hw-03 3507 521 Hw-04 7892 744 Hw-05 7620 680 Hw-06 12356 1039 I-01 11760 3667
f (1)
h
A
Text ID
N
460 237 466 342 546 622 617 780 83 48 33 32 30 30 17 39 30 18 18 14 19 10 16 16 11 225 130 48 76 32 19 121 277 535 416 901 388
24 25 26 29 28 32 39 41 12 9 8 8 8 8 5 11 9 7 7 6 8 5 8 8 6 12 8 4 7 6 7 21 26 38 38 44 37
0.9303 0.8833 0.9208 0.9001 0.9302 0.9315 0.9175 0.9245 0.8451 0.8076 0.7563 0.7481 0.7375 0.7371 0.7262 0.7172 0.7030 0.6271 0.6268 0.5856 0.5833 0.5243 0.5148 0.5093 0.4593 0.9407 0.9369 0.9258 0.9101 0.8214 0.6084 0.7552 0.8613 0.8809 0.8564 0.9108 0.8972
Mr-021 Mr-022 Mr-023 Mr-024 Mr-026 Mr-027 Mr-028 Mr-029 Mr-030 Mr-031 Mr-032 Mr-033 Mr-034 Mr-035 Mr-036 Mr-038 Mr-040 Mr-043 Mr-046 Mr-052 Mr-149 Mr-150 Mr-151 Mr-154 Mr-288 Mr-289 Mr-290 Mr-291 Mr-292 Mr-293 Mr-294 Mr-295 Mr-296 Mr-297 R-01 R-02 R-03
3846 4099 4142 4255 4146 4128 5191 3424 5504 5105 5195 4339 3489 1862 4205 4078 5218 3356 4186 3549 2946 3372 4843 3601 4060 4831 4025 3954 4765 3337 3825 4895 3836 4605 1738 2279 1264
V
f (1)
h
A
1793 58 20 0.6561 1703 142 21 0.8464 1872 72 20 0.7222 1731 80 20 0.7485 2038 84 19 0.7743 1400 92 21 0.7659 2386 86 23 0.7320 1412 28 17 0.3961 2911 86 20 0.7699 2617 91 21 0.7701 2382 98 23 0.7640 2217 71 19 0.7347 1865 40 17 0.5812 1115 29 11 0.6339 2070 96 19 0.8018 1607 66 20 0.6959 2877 81 21 0.7430 1962 44 16 0.6435 1458 68 20 0.7034 1628 89 17 0.8083 1547 47 12 0.7538 1523 64 16 0.7520 1702 192 23 0.8719 1719 68 17 0.7519 2079 84 17 0.7995 2312 112 19 0.8300 2319 42 17 0.6029 1957 86 18 0.7913 2197 88 19 0.7849 2006 41 13 0.6940 1931 85 17 0.8012 2322 97 20 0.7939 1970 92 18 0.8046 2278 88 18 0.7971 843 62 14 0.7714 1179 110 16 0.8497 719 65 12 0.8128 (continued on next page)
78 The geometry of word frequencies Table 4.1 (continued from previous page) Text ID
N
V
f (1)
h
A
Text ID
N
V
f (1)
h
A
I-02 I-03 I-04 I-05 In-01 In-02 In-03 In-04 In-05 Kn-003 Kn-004 Kn-005 Kn-006 Kn-011 Kn-012 Kn-013 Kn-016 Kn-017 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04
6064 854 3258 1129 376 373 347 343 414 3188 1050 4869 5231 4541 4141 1302 4735 4316 345 1633 809 219 3311 4010 4931 4285
2203 483 1237 512 221 209 194 213 188 1833 720 2477 2433 2516 1842 807 2356 2122 174 479 272 116 2211 2334 2703 1910
257 64 118 42 16 18 14 11 16 74 23 101 74 63 58 35 93 122 20 124 62 18 133 190 103 99
25 10 21 12 6 7 6 5 8 13 7 16 20 17 19 10 18 18 8 17 12 6 12 18 19 20
0.8954 0.8385 0.8129 0.7102 0.6439 0.6182 0.5895 0.5811 0.4959 0.8291 0.7189 0.8439 0.7319 0.7356 0.6744 0.7241 0.8080 0.8515 0.5911 0.8364 0.7791 0.6624 0.9117 0.9028 0.8169 0.7962
R-04 R-05 R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
1284 1032 695 968 845 892 625 1059 753 2595 3853 6025 17205 756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
729 567 432 223 214 207 181 197 422 1240 1792 2536 6073 457 603 907 1102 2223 267 222 140 153 124 611 720 645
49 46 30 111 69 66 49 74 31 138 144 228 701 47 66 102 328 193 159 103 45 78 39 89 107 128
10 11 10 14 13 13 11 15 8 16 21 25 41 9 13 13 21 25 17 15 13 12 11 14 15 19
0.8001 0.7601 0.6688 0.8233 0.7672 0.7571 0.7361 0.7368 0.7500 0.8784 0.8490 0.8848 0.9363 0.8085 0.7955 0.8679 0.9207 0.8642 0.8386 0.7994 0.6409 0.7848 0.6555 0.8310 0.8485 0.8303
Fortunately, A is a simple proportion, hence an asymptotic test can easily be performed. Comparing the index for two texts, say A1 and A2 , we calculate the mean of both as Ah1 + Ah2 A¯ = (4.5) Amax 1 + Amax 2 and insert it into the criterion A1 − A2 1 1 ¯ ¯ A(1 − A) + Amax 1 Amax 2
z = s
(4.6)
The rank frequency distribution
79
which is asymptotically normally distributed. If z > 1.96, then text T1 is significantly richer than text T2 , if z < −1.96, then text T2 is significantly richer (at the 0.05 level). Of course, the significance level can be determined differently. Though the differences seem to be very small, it is easy to show that many of them are significant. For example, we compare texts E-09 and E-05 with each other, getting Ah(E−09) = 241400.5, Amax(E−09) = 268196.5, AE−09 = 0.9001, Ah(E−05) = 198737,
Amax(E−05) = 221112,
AE−05 = 0.8988,
A¯ = (241400.5 + 198737) / (268196.5 + 221112) = 0.8995, z = 1.48. The result is not significant. Comparing E-05 and E-07, however, with values of AE−05 = 0.8988 and AE−07 = 0.8833, we obtain z = 15.94 which is highly significant. Although ordering of texts according to A is correct, intuitive judgements can be very fallacious. Index A does not depend on text length N. In Figure 4.3, one can see the change of A according to N. No trend is observable, the data are rather subdivided into layers for which a possible dependence might be found. But if there is any, it is on boundary conditions which must be determined by very extensive research. In order to set up classifications, one could take the mean (m1 ) of all A’s and form different confidence intervals around the mean. Though this result could change with every new language, we perform it with our K texts and obtain: 1 K m1A = (4.7) ∑ Ai K i=1 where K is the number of texts used and i is the running index. The empirical variance of A is obtained from s2A =
1 K ∑ (Ai − m1A )2 , K i=1
(4.8)
hence a confidence interval can be constructed as m1A − uα /2 s ≤ A ≤ m1A + uα /2 s
(4.9)
80 The geometry of word frequencies
Figure 4.3: The relationship between text length N and index A
where u is the quantile of the normal distribution. Unfortunately, there is no objective criterion for determining u. If one wants the interval to encompass 95% of all observations, then u = 1.96, for 99% of observations u = 2.57. Using these two values we obtain 6 domains of A. Below the mean: smaller than the 99% limit (extremely small richness) between the 95% and the 99% limit (small richness) between the mean and the 95% limit (negative richness) Above the mean: greater than mean but smaller than the 95% limit (positive richness) between the 95% and the 99% limit (great richness) greater than the 99% limit (extreme richness) But even if we classify the texts in this way, texts placed in the same domain can differ significantly. Hence a simple ordering according to A may serve the same aim.
The spectrum
4.3
81
The spectrum
The same reflection can be made concerning the frequency spectrum; here we have the analogous k-point (see Section 3.2): words contributing to vocabulary richness are at the beginning of the distribution, and the more words there are (i.e. words occurring once, twice, . . . ) the smaller will the triangle be, because the spectrum curve would lie near the upper straight line joining Q1 and Q2 as can be seen in Figure 4.2. Thus the smaller the triangle, the greater the vocabulary richness. Here we denote by W the greatest non-zero class and by g(1) the occurrence frequency at x = 1. The triangle Bk can be computed analogically, namely as Bk =
1 [W g (1) + 2k − k (W + g (1)) − 1] , 2
(4.10)
the maximal triangle is defined as Bmax =
1 (W − 1) (g (1) − 1) 2
(4.11)
B = Bk /Bmax .
(4.12)
and the index B as In Table 4.2 all B-values for our texts can be found. The testing procedure is identical as with index A. Comparing two texts we build the mean B as B¯ =
Bk1 + Bk2 Bmax 1 + Bmax 2
(4.13)
and insert it in the criterion B1 − B2 z = r 1 + 1 ¯ − B) ¯ B(1 Bmax 1 Bmax 2
(4.14)
which is asymptotically normally distributed. Let us again illustrate the testing comparing the indices B with Kannada texts Kn-003 and Kn-004; we have the initial data
Kn-003 Kn-004
W
g(1)
k
Bk
Bmax
22 13
1374 565
8 5
9537.5 2232
14416.5 3384
B 0.6616 0.6596
82 The geometry of word frequencies from which 9537.5 + 2232 B¯ = = 0.6612. 14416.5 + 3384 Inserting these numbers in formular (4.14) we obtain z = r
0.6616 − 0.6596 = 0.22, 1 1 0.6612(0.3388) 14416.5 + 3384
which is, in spite of very different starting values, not significant. Table 4.2: Index B for 176 texts in 20 languages Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03
N
W
g(1)
k
B
Text ID
N
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247
15 12 15 17 13 16 14 9 13 13 15 17 33 10 17 22 27 13 12 19 30 38 33
299 154 213 223 188 308 248 145 242 244 518 413 965 242 446 662 625 286 189 510 663 736 621
5 4 4 4 4 4 5 3 6 5 6 5 7 5 5 5 6 4 5 6 6 7 8
0.7009 0.7077 0.7716 0.799 0.7340 0.7902 0.6761 0.7361 0.5626 0.6502 0.6332 0.7403 0.8063 0.5390 0.7410 0.8035 0.7997 0.7395 0.6151 0.7124 0.8200 0.8297 0.7700
Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-001 Mr-002 Mr-003 Mr-004 Mr-005 Mr-006 Mr-007 Mr-008 Mr-009 Mr-010 Mr-015 Mr-016 Mr-017
1354 829 2062 1175 1434 1289 3620 2330 457 1509 2998 2922 4140 6304 4957 3735 3162 5477 6206 5394 4693 3642 4170
W
g(1)
k
B
15 738 6 0.6361 9 522 4 0.6192 31 203 6 0.8086 24 147 6 0.7484 27 134 8 0.6781 25 193 8 0.6719 45 240 7 0.8385 39 92 9 0.7016 18 87 5 0.7182 24 139 7 0.6957 24 1129 9 0.6451 28 758 7 0.7699 30 1098 11 0.6461 40 1572 10 0.7635 30 1289 10 0.6827 28 936 10 0.657 26 800 8 0.7112 43 1118 10 0.7777 41 1507 12 0.7177 45 968 10 0.7861 36 1327 9 0.7654 29 1327 9 0.7083 32 1241 9 0.7355 (continued on next page)
83
The spectrum Table 4.2 (continued from previous page) Text ID
N
E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05
4622 4760 4862 5004 5083 5701 6246 8193 9088 11265 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936 413 282 1829 3507 7892 7620
W
g(1)
k
B
Text ID
N
W
g(1)
k
B
42 694 7 0.845 Mr-018 4062 34 1250 10 0.7201 43 972 7 0.851 Mr-020 3943 31 1161 10 0.6922 42 641 9 0.7924 Mr-021 3846 31 1225 9 0.7268 45 1076 7 0.8581 Mr-022 4099 34 1194 9 0.7509 48 474 12 0.7427 Mr-023 4142 34 1284 9 0.7513 52 1005 8 0.8558 Mr-024 4255 37 1105 10 0.7418 50 694 10 0.8033 Mr-026 4146 32 1487 10 0.7036 57 881 11 0.8101 Mr-027 4128 34 847 9 0.7481 64 1071 10 0.8487 Mr-028 5191 39 1716 9 0.7848 70 737 11 0.8415 Mr-029 3424 28 910 9 0.6949 19 396 8 0.5934 Mr-030 5504 34 2205 9 0.7539 16 223 6 0.6441 Mr-031 5105 36 1935 8 0.7964 12 221 5 0.6182 Mr-032 5195 39 1723 10 0.7579 15 186 4 0.7695 Mr-033 4339 30 1589 9 0.7191 13 261 5 0.6513 Mr-034 3489 29 1370 8 0.7449 12 249 6 0.5253 Mr-035 1862 18 826 7 0.6398 9 133 4 0.6023 Mr-036 4205 34 1502 8 0.7832 18 380 5 0.7542 Mr-038 4078 35 1017 9 0.7568 14 303 5 0.6791 Mr-040 5218 35 2209 8 0.7909 12 238 4 0.7146 Mr-043 3356 28 1508 7 0.7738 10 233 7 0.3075 Mr-046 4186 35 854 10 0.7247 10 142 4 0.6454 Mr-052 3549 30 1138 9 0.7171 13 177 4 0.733 Mr-149 2946 22 1065 8 0.6601 8 108 3 0.6956 Mr-150 3372 27 976 8 0.7236 12 305 4 0.7174 Mr-151 4843 38 1034 10 0.748 12 216 5 0.6178 Mr-154 3601 30 1184 8 0.7527 9 85 3 0.7262 Mr-288 4060 31 1535 7 0.7961 19 845 7 0.6596 Mr-289 4831 36 1657 9 0.7666 14 639 6 0.6075 Mr-290 4025 25 1750 8 0.7043 7 260 4 0.4884 Mr-291 3954 32 1411 10 0.7033 13 510 5 0.6588 Mr-292 4765 32 1547 10 0.7039 9 251 3 0.742 Mr-293 3337 22 1515 8 0.662 12 58 5 0.5662 Mr-294 3825 27 1383 8 0.7257 36 96 8 0.7263 Mr-295 4895 34 1625 9 0.7526 45 256 7 0.8401 Mr-296 3836 30 1461 8 0.7538 68 348 9 0.8575 Mr-297 4605 34 1670 10 0.7219 66 303 9 0.8504 R-01 1738 26 607 7 0.7501 (continued on next page)
84 The geometry of word frequencies Table 4.2 (continued from previous page) Text ID
N
W
g(1)
k
B
Text ID
Hw-06 12356 84 501 10 0.8736 R-02 I-01 11760 65 2515 14 0.7917 R-03 I-02 6064 46 1605 10 0.7944 R-04 I-03 854 17 383 5 0.7395 R-05 I-04 3258 36 849 6 0.8512 R-06 I-05 1129 23 356 5 0.8069 Rt-01 In-01 376 11 167 3 0.788 Rt-02 In-02 373 11 148 4 0.6796 Rt-03 In-03 347 11 131 5 0.5692 Rt-04 In-04 343 8 146 4 0.5507 Rt-05 In-05 414 11 122 5 0.5669 Ru-01 Kn-003 3188 22 1374 8 0.6616 Ru-02 Kn-004 1050 13 565 5 0.6596 Ru-03 Kn-005 4869 27 1784 9 0.6878 Ru-04 Kn-006 5231 32 1656 10 0.7042 Ru-05 Kn-011 4541 22 1874 8 0.6629 Sl-01 Kn-012 4141 33 1297 9 0.7438 Sl-02 Kn-013 1302 17 612 5 0.7435 Sl-03 Kn-016 4735 28 1696 11 0.6237 Sl-04 Kn-017 4316 27 1514 9 0.6870 Sl-05 Lk-01 345 12 128 4 0.7037 Sm-01 Lk-02 1633 31 303 6 0.8168 Sm-02 Lk-03 809 22 175 6 0.7332 Sm-03 Lk-04 219 11 81 4 0.6625 Sm-04 Lt-01 3311 21 1793 7 0.6967 Sm-05 Lt-02 4010 32 1879 7 0.8033 T-01 Lt-03 4931 33 2050 9 0.7461 T-02 Lt-04 4285 36 1360 9 0.7655 T-03
N
W
g(1)
k
B
2279 1264 1284 1032 695 968 845 892 625 1059 753 2595 3853 6025 17205 756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
27 18 18 19 14 24 25 23 20 28 16 29 36 44 73 13 25 24 35 41 30 28 22 22 18 24 25 32
909 8 0.7231 568 6 0.6971 574 7 0.6366 425 5 0.7683 354 4 0.7607 128 6 0.7432 129 5 0.8021 99 8 0.6104 103 7 0.6254 74 6 0.7463 317 5 0.7207 947 7 0.7794 1366 8 0.7949 1851 9 0.8096 4396 12 0.8447 365 5 0.6557 427 7 0.7359 652 6 0.7749 702 8 0.7841 1594 8 0.8206 120 6 0.7856 97 6 0.7627 76 4 0.8171 77 4 0.8177 67 3 0.8520 466 6 0.7719 543 7 0.7389 448 7 0.7930
As can be seen in Figure 4.4, the dependence of B on N is only slightly linear, the slope of the straight line being 0.00001879. Though the parameters of the straight line yield a significant t-value, the determination coefficient yields a poor value of R2 = 0.25. The mean of B is 0.7142. An extension of the data base will not bring about any improvement. The situation is approximately the same as in the case of index A. In isolation, both indices show a certain geometric aspect of the distribution forming. But let us consider their ratio, A/B, and its relation to N. The ratio can be called wording indicator of a text “giving a complex picture of the play with words
The spectrum
85
Figure 4.4: The relationship between text length N and index B
or rather word forms, their repetition and variation” (cf. Popescu & Altmann 2006b). The corresponding figure (cf. Figure 4.5, p. 86) looks like a crosssection of an asymmetrical funnel with an almost “horizontal” axis. When further texts are added, the cloud will keep its form and will be denser. Its mean will possibly approach 1.0 or the funnel will converge to this value with great N. The differences between texts are great, especially in the domain of small N. Since both A and B are proportions, A/B can be considered a ratio of two independent proportions. The variance of A/B can easily be derived as 1 A2 Var(A) + Var(B), (4.15) B2 B4 hence an asymptotic test criterion for the comparison of two texts or for the deviation from the expected value can easily be set up. In the proportions the values of h, k, V , W , f (1) and g(1) play the role of constants. Var(A/B) =
86 The geometry of word frequencies
Figure 4.5: The relationship between the wording indicator A/B and text size N
5
The dynamics of word classes
Words can be classified in different ways, either nominally (categorially) or ordinally, if one uses at least a comparative property. If the property can be measured on a higher scale, one obtains some discrete classes or continuous intervals of the property (cf. Chapter 10). In grammar, the usual classification inherited from Latin is the classification into word classes like nouns, verbs, adjectives etc. There are several such classifications allocating the words in 9 to more than 100 classes. One cannot speak about the “correctness” of a classification but rather of its aim and conceptual background. Semantically defined classes differ from syntactically defined ones. It is not our aim to develop a new kind of classification but rather to show the method by which a certain classification can be used for textological purposes. Classification is a descriptive procedure. It gets theoretical importance only if it is derived from laws or systems of laws (theories). Within classifications there are usually hierarchies distinguishing taxa, genera, species etc. For our illustrative purposes we shall distinguish between autosemantics (here nouns, verbs, adjective, adverbs and numerals) and the rest called auxiliaries or function words, though for other purposes we shall make other distinctions. Let us take the auxiliaries as a class and consider their presence (= proportion) in different frequency classes. We are concerned with the frequency spectrum. It is well known that among words occurring once (= hapax legomena, g(1)) the number of auxiliaries is very small but proceeding to higher frequencies their number increases. In the class of most frequent words their proportion is 1.00. But from rank frequency distributions we know that several high frequency words are auxiliaries, hence the proportion 1 will be attained long before. Thus the slowly increasing curve must reach an inflection point at which it changes from convex to concave and approaches 1. This fact can be expressed by a simple differential equation (cf. Popescu, Best and Altmann 2007b). Let y(x) be the proportion of auxiliaries in the frequency class x. Then the rate of change is proportional both to the attained value and to the distance from 1, i.e. dy = ay(1 − y), dx yielding the solution
(5.1)
88 The dynamics of word classes
y =
1 1 + be−ax
(5.2)
well known from historical linguistics, epidemiology, sociology etc. It is the logistic curve or S curve, or in our domain “Piotrowski law” (cf. Best & Kohlhase 1983). First we consider the auxiliaries as a class, then we show the behaviour of some other parts of speech. Consider the spectrum of frequencies in A.v. Droste-Hülshoff’s Der Geierpfiff (G-08) shown in the first two columns of Table 5.1. Table 5.1: The spectrum of word forms in Droste-Hülshoff (from Popescu, Best, & Altmann 2007) x
g(x)
Aux
p(Aux)
p(Aux)theor
1 2 3 4 5 6 7 8 9 10 11 12 16 17 20 26 36 39
380 71 14 11 6 8 4 2 1 1 2 2 2 1 1 1 1 1
74 25 7 8 6 7 4 2 1 1 2 2 2 1 1 1 1 1
0.195 0.352 0.500 0.727 1 0.875 1 1 1 1 1 1 1 1 1 1 1 1
0.169 0.333 0.551 0.751 0.881 0.948 0.978 0.991 0.996 0.999 0.999 1 1 1 1 1 1 1 R2 = 0.98
For the attribution to parts-of-speech (PoS) classes, no German PoS tagger has been used; rather all attributions have been made manually. The numbers are to be read as follows: there are 380 words occurring exactly once; 71 words occurring exactly twice, etc. In the third column one finds the number of auxiliaries within the given class, e.g. in the class of words occurring once there are 74 auxiliaries, yielding the proportion 74/380 = 0.1947 placed in
The dynamics of word classes
89
the fourth column (p(Aux)). As can be seen, the proportion increases almost monotonously and reaches 1 long before the highest frequency. For this trend we obtain the resulting curve (5.2) as follows p (Aux)teor =
1 . 1 + 12.1185e−0.9002x
The determination coefficient is R2 = 0.98. Hence we can conclude that the auxiliaries as defined by us follow the prescribed trend. The trend is plotted in Figure 5.1.
Figure 5.1: The frequency structure of auxiliaries in A. Droste-Hülshoff’s text (G08)
For the sake of illustration let us consider another case with different PoS attribution. We use the English PoS tagger http://www.comp.lancs. ac.uk/ucrel/claws/trial.html and pool the following classes: articles (ATO), adverbs (AVO, AVP, AVQ), conjunctions (CJC, CJS, CJT), determiners (DPS, DTO, DTQ), existential “there” (EXO), pronouns (PNI, PNP, PNQ, PNX, POS), prepositions (PRF, PRP), the infinitive marker “to” (TOO) and the negation (XXO). Adverbs can be considered predicates of second order, hence one can examine this grouping by way of trial. We analyse the Nobel Lecture of E. Rutherford (Text E-08). The first 20 classes were left in their original form, the classes 21–100 have been pooled and the mean x = 60.5
90 The dynamics of word classes was used; for the remaining classes 101–464 the mean 282.5 was taken. The result and the fitting are shown in Table 5.2 and displayed in Figure 5.2. Table 5.2: The observed and computed proportion of “auxiliaries” in Rutherford’s Nobel lecture x 1 2 3 4 5 6 7 8 9 10 11 13 14 16 20 60.5 282.5
p(Aux)
p(Aux)theor
0.0373 0.0513 0.0875 0.0227 0.0976 0.1250 0.2273 0.2500 0.3333 0.1667 0.1667 0.4000 0.2500 0.5000 0.5000 0.9286 10.000
0.0693 0.0794 0.0908 0.1037 0.1182 0.1344 0.1524 0.1944 0.2185 0.2446 0.2728 0.3030 0.3349 0.4032 0.5489 0.9979 10.000
y = 1/ (1 + 15.5663exp (−0.1471x)) , R2 = 0.94, IP = 18.66
The formula yields y=
1 1 + 15.5633e−0.1471x
and the determination coefficient is R2 = 0.94. Evidently the dynamics of auxiliaries is law-like and follows a subconscious control. The computation in some other non-English texts (with authors’ “hand made” PoS tagging) yielded mostly satisfactory results as well (see Table 5.3). Evidently, the parameters differ according to the kind of qualitative pooling of PoS classes. Probably there will be differences between the languages, too. The role of N is unknown. Hence here a new research field starts. As can be seen in Figure 5.3 the parameter b does not depend on N in any way, i.e. it is formed in the course of text generation and represents a characteristic
The dynamics of word classes
91
Figure 5.2: Auxiliaries in Rutherford’s Lecture (E-08)
of text. Its values in individual texts represent stylistic variations. Formally, it is the integration constant. On the other hand, parameter a depends almost log-linearly on N . It is not a text characteristic. In formula (5.1) it is the proportionality constant following mechanically the text increase. Though some deviations from the log-linear trend are considerable, an important role may be played by the quality of the tagger or of the way of establishing PoS classes and their pooling. This whole research area must be developed further. It goes without saying that autosemantics have just the opposite trend. If the proportion of auxiliaries in individual frequency classes follows a trend yaux , then autosemantics as a group automatically have the trend 1 − yaux . The curve is of the same kind because 1 be−ax 1 1 = = = = yaut . 1 −ax −ax 1 + be 1 + be 1 + ceax 1 + b eax (5.3) The result need not be tested since it follows automatically. However, another question is whether all individual main autosemantic classes follow exactly this trend, or only their sum in general. Unfortunately, the hypothesis of monotonous decrease of their proportion with increasing frequency class can be tested only in long texts in which there is enough place for their repetition. For shorter texts the trend is not conspicuous enough and can be disturbed 1 − yaux = 1 −
92 The dynamics of word classes Table 5.3: Survey of dynamics of some auxiliaries (66 texts in seven languages) Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13
N 761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8193 9088 11265
a
b
R2
Text ID
0.503 2.4461 0.70 G-01 1.261 10.1416 0.99 G-02 0.674 3.5324 0.90 G-03 1.045 5.8647 0.98 G-04 1.357 8.4482 1 G-05 0.803 5.7517 0.89 G-06 0.412 1.7361 0.51 G-07 1.425 6.8252 0.85 G-08 0.862 4.1546 0.92 G-09 1.16 10.4289 0.96 G-10 0.464 3.1478 0.56 G-11 0.561 5.4606 0.85 G-12 0.327 3.7282 0.95 G-13 0.484 4.4097 0.87 G-14 0.62 4.1239 0.87 G-15 0.533 4.6342 0.93 G-16 0.186 2.3334 0.61 G-17 1.125 16.3612 0.93 H-01 0.497 3.2014 0.88 H-02 0.856 8.4415 0.97 H-03 0.32 5.0772 0.78 H-04 0.234 3.925 0.76 H-05 0.201 4.9023 0.71 In-01 0.193 7.0149 0.76 In-02 0.246 7.9525 0.89 In-03 0.099 2.9526 0.61 In-04 0.182 4.0872 0.73 In-05 0.153 7.1273 0.70 R-01 0.174 5.3667 0.74 R-02 0.137 5.8501 0.73 R-03 0.086 4.6265 0.75 R-04 0.102 5.5493 0.75 R-05 0.102 5.5493 0.76 R-06
N
a
b
R2
1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936 413 376 373 347 343 414 1738 2279 1264 1284 1032 695
0.9399 0.359 0.551 0.3037 1.1624 1.2188 0.7703 0.9002 1.3395 0.903 1.4313 1.7188 0.3306 1.9911 0.9578 0.3692 1.1313 0.3193 0.4721 1.2668 1.0783 0.4472 0.2103 0.1511 0.411 0.3591 0.1611 0.4265 0.6897 0.8122 0.7948 1.0499 1.4529
14.311 2.8639 4.6406 1.5491 7.4375 8.1308 5.8229 12.1185 8.6354 5.6211 11.2433 16.4687 2.3301 14.6534 6.7369 1.7524 9.5409 4.6037 4.1255 10.6036 11.0967 3.3814 3.1922 1.9517 5.6466 3.854 4.5405 7.8105 11.2466 10.2113 9.3003 19.864 17.9624
0.96 0.88 0.91 0.6 0.93 0.91 0.84 0.98 0.87 0.90 0.97 1 0.80 0.74 0.98 0.71 0.56 0.8 0.96 0.97 1 0.73 0.53 0.41 0.68 0.82 0.40 0.92 0.96 0.98 0.98 0.99 0.83
even by one word. For example in the text G-08 (Droste-Hülshoff) we have the following proportion of adjectives in individual frequency classes: for x = 1: p1 = 0.197, for x = 2: p2 = 0.070, for x = 3: p3 = 0.071 and for x = 6: p6 = 0.125, i.e. we obtain a convex curve, contradicting our hypothesis only
The dynamics of word classes
93
Figure 5.3: The relationship of the parameters of curve 5.2 to text length
because the text is short. Hence we leave it to the reader to scrutinize this further.
6
Thematic concentration of the text
Each writer/speaker speaks about something – though in everyday life the impression may arise that there is more “empty talk” than information in people’s communication. Yet, writers usually speak about a sharply defined topic and concentrate more or less on the core of the information. Usually this topic is linguistically represented by a particular number of nouns (or even proper nouns) and first-order predicates, namely verbs and adjectives. All other linguistic units are represented by predicates of a higher order (e.g. adverbs are second order predicates because they modify verbs and adjectives), or relational or referential expressions. Let us summarizingly call nouns, verbs and adjectives thematic words, although e.g. in theatre plays it is rather proper nouns and pronouns that are mainly relevant. Thematic words can thus be defined according to circumstances. Thematic concentration is a matter of the speaker’s/writer’s intention, intuition and eloquence but as an object of research it must be defined operationally. Various methods of content analysis can be used for this purpose. Another method is the denotative analysis taking into account all references and operationally defining the text-core (cf. Ziegler & Altmann 2002). Here we shall use the fact that the h-point is a fuzzy border between auxiliaries and autosemantics. It is fuzzy because autosemantics can occur above this border, and auxiliaries can also occur below this border. Those autosemantics from the class of thematic words, which occur above the h-point, are strongly characterized by their frequency behaviour: they represent the text theme as nouns or first order predicates expressing the properties or actions of the central words. The situation can be illustrated by way of an example from Popescu & Altmann (2008), using the thirty most frequent words in E. Rutherford’s Nobel Lecture (E-08). The data can be seen in Table 6.1. We locate the h-point at r = 26 = h. There are six thematic word forms which have a rank smaller or equal to h, namely radium, particles, helium, particle, atom and rays. These word forms are sufficient to guess the theme of Rutherford’s text. But their concentration must be expressed quantitatively. Popescu & Altmann (2008) proposed to consider both the frequency and the weight of the word using its distance from h. Let the frequency of a pre-h word be f (r′ ) and its weight simply the distance h − r′ . If h = 26 and the
96 Thematic concentration of the text Table 6.1: The most frequent 30 words from E-08 (E. Rutherford) r
f (r)
Word
1 2 3 4 5 6 7 8 9 10
466 382 140 121 116 113 87 85 64 63
the of a that and in to was radium by
r
f (r)
11 12 13 14 15 16 17 18 19 20
60 58 56 53 51 45 42 42 41 40
Word it from particles helium is particle be this at were
r
f (r)
21 22 23 24 25 26 27 28 29 30
39 36 30 29 28 27 25 24 24 23
Word atom with for rays an as its on or radioactive
word occurs at rank 1, then its weight is 26 − 1 = 25, i.e. the lower the rank of the thematic word, the greater its weight. In order to relativize the weight × f requency, (h − r′ ) f (r′ ) of each thematic word, we divide it by the sum of all weights and by the greatest possible frequency, namely f (1). The sum of all weights yields h
h
r=1
r=1
∑ (h − r) = h(h)− ∑ r
= h2 −
h(h + 1) h(h − 1) = . 2 2
Hence we obtain an index of thematic concentration in the form (h − r′ ) f (r′ ) , r′ = 1 h(h − 1) f (1) T
TC = 2
∑
(6.1)
where r′ are the ranks of thematic words. For the sake of illustration we compute the index for Rutherford’s text E-08 in detail. Here h = 26, there are 6 autosemantics in the pre-h domain, hence T = 6, further f (1) = 466. We compute first the normalizing constant 2 2 = = 0.0000066028 = C h(h − 1) f (1) 26(25)466 and multiply it with the individual weights and frequencies. For example radium is at rank r′ = 9 and has frequency f (9) = 64, hence (26 − 9) · 64 = 1088. Multiplied with the normalizing constant C we obtain 0.0000066028 · 1088 = 0.00718.
Thematic concentration of the text
97
The six thematic word forms yield the following values: Word
Rank
Frequency
(h − r′ ) f (r′ )C
9 13 14 16 21 24
64 56 53 45 39 29
0.00718 0.00481 0.00420 0.00297 0.00129 0.00038
radium particles helium particle atom rays TC
0.02083
We see that the individual values are of order 1/1000. To obtain a slightly more lucid number we define the thematic concentration unit (or shortly, tcu) as tcu = 1000 (TC) . For the given text, we obtain tcu(E-08)= 20.83. In Table 6.2 one finds the tcu values for different texts in different languages. Table 6.2: Thematic concentration TC (in tcu) for 75 texts in nine languages Text ID h
f (1)
B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05
40 13 15 21 19 28 19 10 20 26 58 56 182 27 84
10 8 9 8 7 9 8 6 9 7 9 11 19 7 9
pre-h autosemantics (rank/frequency) e (5/14)
e (5/10)
e (4/7) e (6/11) e (7/8) pan (6/10) strýc (3/21). Pepin (9/11) strýc (5/54) Vladimír (4/9), Bondy (5/7)
TC (in tcu)
38.89 0 0 0 50.13 0 0 93.33 45.83 0 14.37 61.69 24.29 72.31 0 (continued on next page)
98 Thematic concentration of the text Table 6.2 (continued from previous page) Text ID h
f (1)
Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-03 E-04
13 15 8 6 11 19 23
E-05 E-06
26 24
E-07 E-08
25 26
E-09 E-10
29 28
E-11
32
E-12
39
E-13
41
G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13
12 9 8 8 8 8 5 11 9 7 7 6 8
106 0 134 pˇrednosta (8/28), ˇrekl (14/17) 15.14 31 0 30 0 50 0 229 peace (13/33) 5.06 366 political (10/47), politics (16/32), individual 9.85 (21/26), individuals (22/25), rules (23/24) 297 art (25/26) 0.27 460 insulin (15/41), sugar (18/34), pancreas (20/33), 6.04 blood (22/31), symptoms (24/24) 237 American (16/36), America (24/27) 4.94 466 radium (9/64), particles (13/56), helium (14/53), 20.83 particle (16/45), atom (21/39), rays (24/29) 342 power (28/30) 0.22 546 nuclear (10/70), world (12/50), weapons (13/47), 19.35 war (14/47), great (20/37), nations (21/35). human (27/29) 622 insulin (9/109), sugar (11/75), diet (21/46), patient 16.92 (23/45), blood (27/39), carbohydrate (31/33) 617 people (16/73), novel (17/73), Chinese (20/65), 10.34 novels (36/42), China (37/41) 780 time (25/67), theory (36/51), quantum (39/45), 2.22 electrodynamics (41/41) 83 0 48 Bär (5/12) 27.78 33 0 32 Vöglein (7/10) 11.16 30 0 30 Kamel (5/11) 39.29 17 0 39 0 30 0 18 0 18 0 14 0 19 Gorm (7/8) 15.04 (continued on next page)
pre-h autosemantics (rank/frequency)
TC (in tcu)
Thematic concentration of the text
99
Table 6.2 (continued from previous page) Text ID h
f (1)
G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 I-01 I-02 I-03 I-04 I-05 In-01
5 8 8 6 12 8 4 7 6 37 25 10 21 12 6
10 16 16 11 225 130 48 76 32 388 257 64 118 42 16
In-02
7
18
In-03
6
14
In-04 In-05
5 8
11 16
Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 R-01 R-02 R-03 R-04 R-05 R-06
12 18 19 20 8 7 14 16 12 10 11 10
pre-h autosemantics (rank/frequency)
Tiger (6/8) Vater (3/9) magyarok (8/16) nyelvi (8/7)
kuncze (5/7) don (23/28)
coretti (11/13) tim (1/16), Manuel (3/12), asisten (5/8), pelatih (6/7) program (2/11), bri (5/10), bank (6/9), hadiah (7/7) asisten (1/14), psm (3/9), pelatih (4/9), perubahan (5/8) asap (4/7) penumpang (1/16), km (2/11), kapal (4/10), pelni (6/10)
133 190 inquit (5/35) 103 99 33 Caesar (5/16) 19 62 110 65 49 46 30
TC (in tcu) 0 0 35.71 163.64 4.31 0 0 0 14.58 0 0.73 0 0 4.69 516.67 222.22 585.71 63.64 531.25 0 15.65 0 0 5195 0 0 0 0 0 0 0
100
Thematic concentration of the text
The pre-h content words are very characteristic of the text. As a matter of fact, they show what the text is about. The study of genres or styles could be made more exact if one would use this kind of characterization. In general, scientific texts have greater thematic concentration than artistic ones. If we examined different sciences and different genres in art, the picture would possibly get more exact contours. In any case art and social sciences will follow “hard science” and within art, poetry will have the smallest thematic concentration. Poets do not like repetition, they care for “good” style. But in one language all texts of the same genre can have quite different thematic concentration than in another. This case can be observed e.g. in Indonesian vs. Hungarian journalistic texts. The difference is, perhaps, caused by the very strong agglutination in Hungarian giving rise to many word forms. Hence an interlingual comparison should take into account also some boundary conditions that must be quantified. Or, in other words, thematic concentration should be embedded in a Köhlerian self-regulation cycle. Further research must be made on much broader basis, comparing works of individual authors. There are authors who wrote both scientific and literary works and their style can be compared using this criterion. In any case, thematic concentration operationalized in this way could become a simple but exact style characteristic. There are, of course, other possibilities of measuring thematic concentration but all of them are more complex and require the involvement of semantics.
7 7.1
Crowding, pace filling and compactness
Crowding
In the previous chapter we considered the thematic autosemantics above the h-point as indicators of the thematic concentration of a given text. However, the h-point can be used also for further purposes. Consider the domain above the h-point as the first element in the sequence of intervals of length h into which the rank frequency distribution can be partitioned. Since there are V different word forms (vocabulary) in the text, there are V /h rank intervals of length h. For example in Rutherford’s text (E-08) there are V = 995 different words, h = 26, hence there are V /h = 995/26 = 38.27 rank steps. The number of autosemantics in such an interval can be called crowding of autosemantics. The minimal number of autosemantics in a h-interval is 0, the maximum is h. Hence the crowding could be normalized (relativized) in a simple way, making texts directly comparable. It is generally known that the number of autosemantics in the first h-step is small – because the most frequent words are auxiliaries – and in the next steps it increases. The increase is very quick at the beginning, and the rate of change slows down the nearer the crowding (y) comes to its asymptote which is usually smaller than h. If we denote the asymptote as a, the rate of change of crowding can be expressed as dy = k(a − y) dx
(7.1)
where x represents the individual steps (h-intervals), y is the crowding (number of autosemantics in the h-interval), a is the asymptote and k is the proportionality constant. Assuming y(0) = 0 we obtain the solution of (7.1), y = a (1 − exp(−kx)) ,
(7.2)
representing the increase of the number of autosemantics in subsequent hintervals. The general form of this dependence can be seen in Figure 7.1. It is to be noted that the curve depends on the parts of speech making up the class of autosemantics. Other definitions of autosemantics would change the parameters but the general trend would be the same. Needless to say, the
102
Crowding, pace filling and compactness
Figure 7.1: The exponential function fitting the crowding of autosemantics in subsequent h-intervals of the rank frequency distribution (from Popescu & Altmann 2007c)
trend of auxiliaries is just the other way round, symmetric to (7.2), because the number of auxiliaries decreases in subsequent h-intervals. Perhaps there are special trends for individual parts of speech depending on style and genre, but this problem must be left to future research. Let us illustrate the problem using the same text, namely Rutherford’s Nobel Lecture (E-08). The data are shown in Table 7.1. Here h = 26, hence the yardstick yields 38 full intervals (V /h = 995/26 = 38.27) and one partial interval at the end of the distribution which can be discarded. For E-08 we obtain the curve y = 21.4658 (1 − exp (−0.35932x)) . The re-ranking of intervals is shown in the second column of Table 7.1 the number of autosemantics in each interval is in the third column. As can be seen, the curve does not reach h = 26, but the data reach it quickly, and from the seventh interval we have only a random fluctuation of values. This fluctuation is caused by the specific way of ordering the words in the counter output. We obtain other fluctuations if the words in a certain class x are ordered alphabetically, or if they are ordered according to their length, etc. Thus
Pace filling
103
Table 7.1: Stepwise crowding of autosemantics in Rutherford’s Nobel lecture (E-08) (from Popescu & Altmann 2007c) yardstick h = 26
re-ranking x paces
autosemantic pace filling y
yardstick h = 26
re-ranking x paces
autosemantic pace filling y
26 52 78 104 130 156 182 208 234 260 286 312 338 364 390 416 442 468 494 520
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
6 14 13 19 12 19 25 19 20 19 20 18 20 22 19 22 22 21 17 20
546 572 598 624 650 676 702 728 754 780 806 832 858 884 910 936 962 988 995
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 38.27*
20 23 22 24 21 22 24 20 23 22 23 20 24 25 24 22 21 20 7
*
V /h = 995(ranks)/26(ranks/pace) = 38.27paces
the fluctuation itself is irrelevant. The data and the curve are presented in Figure 7.2. The fitting yields for Rutherford (E-08) only R2 = 0.65 but the t-test for parameters and the F-test for regression yield highly significant results.
7.2
Pace filling
In the preceding section, we interpreted the crowding of autosemantics in the first h-interval as thematic concentration. Analogously one could call the crowding in the second h-interval a secondary thematic concentration, etc. Texts could be compared easily with respect to this. But for the sake of a more complete comparison we consider rather the extent a/h to which func-
104
Crowding, pace filling and compactness
Figure 7.2: Fitting the exponential to autosemantic compactness in text E-08
tion (7.2) approaches by pace h. We call a/h autosemantic pace filling (APF). Here all intervals are taken into account, not only the first one. In Table 7.2 it can be seen that APF does not depend on N, while the tangent of the curve at (0, 0), characterizing the compactness (see below), strongly depends on N. The comparison of two texts is very simple. Consider the APF values of E-08 (E. Rutherford) and E-07 (S. Lewis), two texts with approximately the same length: E. Rutherford
S. Lewis
h 26 25 a 21.407 22.612 APF = a/h 21.407/26 = 0.823 22.612/25 = 0.904 As can be seen in Figure 7.2 the optimization software provides with each parameter its empirical standard deviation. Since Var(APF) = Var(a/h) = Var(a)/h2 , we can set up an asymptotic test for the difference of two autosemantic pace fillings as APF1 − APF2 APF1 − APF2 z = p = s . Var(APF1 ) +Var(APF2 ) Var(a1 ) Var(a2 ) + h21 h22
(7.3)
Pace filling
105
In our case we obtain z = s
0.823 − 0.904 0.418822 0.300312 + 262 252
= −4.03
indicating a highly significant difference. The APF with Lewis is significantly greater than that with Rutherford, though the difference is visually rather small. Table 7.2 presents the results, with tan α = ak denoting autosemantic compactness, and a/h pace filling. Table 7.2: Autosemantics compactness and pace filling data (64 texts in seven languages) Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-03 E-04 E-05
N
h
a
k
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 3247 4622 4760
10 8 9 8 7 9 8 6 9 7 9 11 19 7 9 13 15 8 6 11 19 23 26
7.888 6.283 6.423 5.455 5.154 7.317 6.122 4.048 7.043 5.009 7.357 8.472 15.576 5.234 7.055 11.449 11.764 5.810 4.030 8.765 16.677 20.204 23.562
0.104 0.204 0.177 0.173 0.152 0.137 0.126 0.191 0.149 0.182 0.133 0.183 0.143 0.156 0.115 0.101 0.176 0.181 0.176 0.133 0.190 0.256 0.208
tan α = ak
a/h
0.820 0.789 1.282 0.785 1.137 0.714 0.944 0.682 0.783 0.736 1.002 0.813 0.771 0.765 0.773 0.675 1.049 0.783 0.912 0.716 0.978 0.817 1.550 0.770 2.227 0.820 0.817 0.748 0.811 0.784 1.156 0.881 2.070 0.784 1.052 0.726 0.709 0.672 1.166 0.797 3.169 0.878 5.172 0.878 4.901 0.906 (continued on next page)
106
Crowding, pace filling and compactness
Table 7.2 (continued from previous page) Text ID E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 In-01 In-02 In-03 In-04 In-05 R-01 R-02
N
h
a
k
4862 5004 5083 5701 6246 8193 9088 11265 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936 413 346 373 347 343 414 1738 2279
24 25 26 29 28 32 39 41 12 9 8 8 8 8 5 11 9 7 7 6 8 5 8 8 6 12 8 4 7 6 6 7 6 5 8 14 16
20.483 22.612 21.407 26.434 24.426 27.185 35.975 34.196 9.829 6.039 6.125 6.226 6.476 6.258 3.782 9.038 7.790 6.007 5.921 4.774 5.674 3.544 6.301 6.475 4.827 10.082 6.441 3.201 5.432 4.727 4.253 5.568 4.129 3.785 6.669 11.658 14.210
0.260 0.164 0.359 0.200 0.259 0.238 0.252 0.316 0.177 0.167 0.216 0.127 0.117 0.124 0.217 0.167 0.101 0.106 0.103 0.208 0.281 0.233 0.128 0.097 0.196 0.175 0.133 0.143 0.142 0.167 2.569 0.347 3.398 0.291 0.628 0.173 0.152
tan α = ak
a/h
5.326 0.853 3.708 0.904 7.685 0.823 5.287 0.912 6.326 0.872 6.470 0.850 9.066 0.922 10.806 0.834 1.740 0.819 1.009 0.671 1.323 0.766 0.791 0.778 0.758 0.810 0.776 0.782 0.821 0.756 1.509 0.822 0.787 0.866 0.637 0.858 0.610 0.846 0.993 0.796 1.594 0.709 0.826 0.709 0.807 0.788 0.628 0.809 0.946 0.805 1.764 0.840 0.857 0.805 0.458 0.800 0.771 0.776 0.789 0.788 10.926 0.709 1.932 0.795 14.030 0.688 1.101 0.757 4.188 0.834 2.017 0.833 2.160 0.888 (continued on next page)
Compactness
107
Table 7.2 (continued from previous page) Text ID R-03 R-04 R-05 R-06
7.3
tan α = ak
N
h
a
k
1264 1284 1032 695
12 10 11 10
10.221 8.466 9.287 8.532
0.157 0.133 0.174 0.179
1.605 1.126 1.616 1.527
a/h 0.852 0.847 0.844 0.853
Compactness
Let us study the exponential function in (7.2) depicted in Figure 7.1. Since the higher ranked h-intervals are full of autosemantics, it is rather the beginning of the curve, i.e. the thematic concentration, (the secondary thematic concentration, etc.) which causes the autosemantic compactness (AC) of the text. Thus the slope of the curve at x = 0 plays a decisive role in its characterization. The first derivation of (7.1) yields dy = ak exp(−kx) dx
(7.4)
which after inserting x = 0 yields tan α = ak = AC.
(7.5)
As can be seen in Table 7.2 the range of these values is greater and they evidently depend almost linearly on text length N. But even in this case it is possible to perform tests for the difference between two ACs. To this end we again need the variance of AC. From (7.5) by means of Taylor expansion one can easily obtain the following approximate equalities, V (AC) = V (ak) =
∂ AC ∂a
2
Var(a) +
∂ AC ∂k
2
Var(k)
(7.6)
yielding V (AC) = k2Var(a) + a2Var(k),
(7.7)
in which the estimated values of parameters can be inserted. We ignore the covariance of the parameters considering them independent for the sake of
108
Crowding, pace filling and compactness
simplicity and set up an asymptotic criterion for measuring the difference between the autosemantic compactness of two texts as AC1 − AC2 z = p . V (AC1 ) + V (AC2 )
(7.8)
For illustration we use again texts E-08 and E-07 which have similar text length (5083 and 5004 for E-08 and E-07 respectively) but very different compactness: Rutherford (E-08) Lewis (E-07) a k AC = ak Var(a) Var(k)
21.407 0.359 7.685 0.4192 0.0502
22.612 0.164 3.603 0.3002 0.0132
Inserting these values in (7.7) we obtain V (ACE-08 ) = 0.3592 0.4192 + 21.4072 0.0502 = 1.1673, V (ACE-07 ) = 0.0888. Then (7.8) gives z = √
7.685 − 3.708 = 4.457 1.16728. + 0.08883
which is, again, highly significant. Taking into account the co-variances would give a slightly smaller result; recall the test between two APFs above. Consider now two texts with very different N and different AC, namely Rutherford’s E-08 and Feynman’s E-13. Rutherford E-08 Feynman E-13 N a k ak = AC V (a) V (k) V (AC)
5083 21.407 0.359 7.6856 0.4192 0.0502 1.1673
11265 34.196 0.319 10.806 0.64332 0.0402 1.1931
Compactness
109
Inserting all these values in (7.8) we obtain 7.685 − 10.806 z = √ = −1.77 1.16728 + 1.9131 showing that texts with very different size of N need not differ significantly even if we neglect the expectations of the both ACs given by the linear regression. Using the parameters of the curve it is possible to determine the place of a text in a two-dimensional coordinate system. Remember that the tangent line to the curve at x = 0 has the equation y = akx. Further, the asymptote of the curve is a. The crossing point of these two straight lines is P(1/k, a). Thus the autosemantic compactness (AC = ak = tan α ) varies proportionally with k and inverse proportionally with 1/k, the x-coordinate of point P. From the points determining the position of the autosemantic construction of texts one can compute distances between texts and perform classifications or discriminant analyses. In Table 7.3 one can find the autosemantic coordinates of some texts. The graphical representation of this is given in Figure 7.3.
Figure 7.3: Autosemantic coordinates of 64 texts in seven languages ordered by the compactness length 1/k
As can easily be deduced from Table 7.3 and Figure 7.3, the indices leaning against the h-point are good stylistic indicators useful for characterization,
110
Crowding, pace filling and compactness
Table 7.3: Autosemantic coordinates of 64 texts in seven languages Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01
N
h
1/k
a
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 3247 4622 4760 4862 5004 5083 5701 6246 8193 9088 11265 1095
10 8 9 8 7 9 8 6 9 7 9 11 19 7 9 13 15 8 6 11 19 23 26 24 25 26 29 28 32 39 41 12
9.615 4.902 5.650 5.780 6.579 7.299 7.937 5.236 6.711 5.495 7.519 5.464 6.993 6.410 8.696 9.901 5.682 5.525 5.682 7.519 5.263 3.906 4.808 3.846 6.098 2.786 5.000 3.861 4.202 3.968 3.165 5.650
7.888 6.283 6.423 5.455 5.154 7.317 6.122 4.048 7.043 5.009 7.357 8.472 15.576 5.234 7.055 11.449 11.764 5.810 4.030 8.765 16.677 20.204 23.562 20.483 22.612 21.407 26.434 24.426 27.185 35.975 34.196 9.829
Text ID G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 In-01 In-02 In-03 In-04 In-05 R-01 R-02 R-03 R-04 R-05 R-06
N
h
1/k
a
845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936 413 346 373 347 343 414 1738 2279 1264 1284 1032 695
9 8 8 8 8 5 11 9 7 7 6 8 5 8 8 6 12 8 4 7 6 6 7 6 5 8 14 16 12 10 11 10
5.988 4.630 7.874 8.547 8.065 4.608 5.988 9.901 9.434 9.709 4.808 3.559 4.292 7.813 10.309 5.102 5.714 7.519 6.993 7.042 5.988 0.389 2.882 0.294 3.436 1.592 5.780 6.579 6.369 7.519 5.747 5.587
6.039 6.125 6.226 6.476 6.258 3.782 9.038 7.790 6.007 5.921 4.774 5.674 3.544 6.301 6.475 4.827 10.082 6.441 3.201 5.432 4.727 4.253 5.568 4.129 3.785 6.669 11.658 14.210 10.221 8.466 9.287 8.532
classification and discrimination of texts. There are distinct domains and differences in both dimensions that could be elaborated on using some advanced statistical methods. But both languages and genres should be studied more systematically.
8 8.1
Autosemantic text structure
Introduction
In previous chapters we considered the simple frequency of word forms and tried to find some characteristics of texts. The words were ordered according to their frequency and we found different kinds of order. We did not care for the structure of the text whose special aspect is the association between autosemantics. We say that two words are to some extent associated if they occur in the same sentence. However, this association may be random or very weak. In order to be associated in a significant way words must occur together in several sentences. Thus our first task is to compute the probability of the given or more extreme than given co-occurrence (see below). The (non-randomly) associated words can be joined with a line representing the edge of a graph. After joining all associated words we obtain a simple graph representing the associative autosemantic structure of the text. However, graphs obtained from texts with many autosemantics are so confusing that their graphical presentation does not bring any elucidation. Nevertheless, these graphs have many properties which can be computed and interpreted textologically. The properties can be compared, one can make statements about the text structure, one can even find the distance between two words, the “thematic density” of the structure, the core of the text, its diameter, etc. This investigation is, however, based on strongly lemmatised text; otherwise we would be forced to admit that different forms of an individual word are not associated with one another because they do not occur in the same sentence. The lemmatisation can be performed in different ways, according to the grammatical philosophy of the researcher (or the available program). Since co-occurrence does not mean immediate neighbourhood but only common occurrence in a sentence (or in a verse), whatever its length, programs analysing word bigrams are not adequate. In German, separable prefixes must be added to the word, i.e., the program must be very flexible. Before we make the first steps in constructing the graphs and computing their properties, let us demonstrate the preparation of a text for this analysis. We use a short poem already analysed in different ways statistically (cf. Ziegler & Altmann 2002; Altmann, V. & Altmann G. 2005). The text is taken from the Gutenberg project (http://Gutenberg.spiegel.de/) and presented in Table 8.1. The
112
Autosemantic text structure
left side of the table contains the original version; the right side contains the lemmas. The punctuation has been discarded, each line is a separate counting unit. Table 8.1: Original and lemmatised text of Goethe’s Erlkönig Wer reitet so spät durch Nacht und Wind? Es ist der Vater mit seinem Kind; Er hat den Knaben wohl in dem Arm, Er faßt ihn sicher, er hält ihn warm.
reiten Nacht Wind Vater Kind Knabe Arm fassen warm halten
Mein Sohn, was birgst du so bang dein Gesicht? – Siehst Vater, du den Erlkönig nicht? Den Erlenkönig mit Kron und Schweif? – Mein Sohn, es ist ein Nebelstreif. –
Sohn bergen Gesicht Sehen Vater Erlkönig Erlkönig Krone Schweif Sohn Nebelstreif
»Du liebes Kind, komm, geh mit mir! Gar schöne Spiele spiel ich mit dir; Manch bunte Blumen sind an dem Strand, Meine Mutter hat manch gülden Gewand.«
lieb Kind kommen gehen schön Spiel spielen bunt Blume Strand Mutter gülden Gewand
Mein Vater, mein Vater, und hörest du nicht, Was Erlenkönig mir leise verspricht? Sei ruhig, bleibe ruhig, mein Kind; In dürren Blättern säuselt der Wind. –
Vater hören Erlkönig versprechen ruhig bleiben Kind dürr Blatt säuseln Wind
»Willst, feiner Knabe, du mit mir gehn? Meine Töchter sollen dich warten schön; Meine Töchter führen den nächtlichen Reihn Und wiegen und tanzen und singen dich ein.«
fein Knabe gehen Tochter warten Tochter führen nächtlich Reihen einwiegen eintanzen einsingen
Mein Vater, mein Vater, und siehst du nicht dort Erlkönigs Töchter am düstern Ort? – Mein Sohn, mein Sohn, ich seh es genau: Es scheinen die alten Weiden so grau. –
Vater sehen Erlkönig Tochter düster Ort Sohn sehen scheinen alt Weide grau
»Ich liebe dich, mich reizt deine schöne Gestalt; Und bist du nicht willig, so brauch ich Gewalt.« Mein Vater, mein Vater, jetzt faßt er mich an! Erlkönig hat mir ein Leids getan! –
lieben reizen schön Gestalt willig brauchen Gewalt Vater anfassen Erlkönig Leids tun
Dem Vater grauset’s, er reitet geschwind, Er hält in den Armen das ächzende Kind, Erreicht den Hof mit Mühe und Not; In seinen Armen das Kind war tot.
Vater grausen reiten halten Arm ächzen Kind erreichen Hof Mühe Not Arm Kind tot
The probability of co-occurrence
113
As can be seen, double occurrence was not taken into account (“Mein Vater, mein Vater”), the participle was changed in verb (“ächzend”), all auxiliaries were discarded, the references were not taken into account, all adverbs were omitted. In German the auxiliary verbs “sein” and “haben” in all their forms were left out. As to adverbial phrases, all non-adverbial autosemantics occurring in them were taken into account (“durch Nacht und Wind”). The result would be quite different, if e.g. the pronouns were replaced by the respective autosemantics or if one considered some kind of hrebs1 rather than individual words. Since we only want to present some elementary methods, positional co-occurrence of realized words is sufficient. One can define different text structures according to the aim of the investigation. The possibilities are practically infinite. Taking only on autosemantics is a very restricted view but the presented methods can be used for other purposes too. We obtain now a new frequency distribution of autosemantics as shown in Table 8.2. Our next problem is to find the probability of co-occurrence of two words as given in the right part of Table 8.1 using the number of sentences (lines), S = 32, and the frequency of individual words as given in Table 8.2. This work can only be done using an appropriate program.
8.2
The probability of co-occurrence
Denote the number of occurrences of word A as M and that of word B as n. Not taking into account the possibly repeated occurrence of a given word in a sentence, the decision is dichotomic: either the given word occurs in the sentence, or it does not. If we thus have word A in M sentences and word B in n sentences, then A and B can co-occur in x sentences (which we regard as “given”), or in x + 1, x + 2, . . . , min{M, n} sentences. In this sense, the part from x + 1 to min{M, n} contains the “more extreme” cases. Consider the following examples (which ignore all possible permutations), each consisting of S = 20 sentences; A or B denote these word being placed in the given sentence, and A¯ and B¯ denote their absence in this sentence: 1. A hreb is a set, or an aggregate, of sentences all of which contain the same morpheme, word, phrase, synonyms of them or references to them (cf. Ziegler & Altmann 2002).
114
Autosemantic text structure
Table 8.2: The frequency of lemmatised autosemantics (Goethe’s Erlkönig) 6 5 5 3 3 3 3 2 2 2 2 2 2 1 1 1 1
VATER E RLKÖNIG K IND A RM S EHEN S OHN T OCHTER G EHEN H ALTEN K NABE R EITEN S CHÖN W IND A LT A NFASSEN B ERGEN B LATT
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
B LEIBEN B LUME B RAUCHEN B UNT D ÜRR D ÜSTER E INSINGEN E INTANZEN E INWIEGEN E RREICHEN FASSEN F EIN F ÜHREN G ESICHT G ESTALT G EWALT G EWAND
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
G RAU G RAUSEN G ÜLDEN H OF H ÖREN KOMMEN K RONE L EIDS L IEB L IEBEN M UTTER M ÜHE NACHT N EBELSTREIF N OT NÄCHTLICH O RT
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
R EIHEN R EIZEN RUHIG S CHEINEN S CHWEIF S PIEL S PIELEN S TRAND S ÄUSELN T OT T UN V ERSPRECHEN WARM WARTEN W EIDE W ILLIG ÄCHZEN
given: x = 2 ¯ A ¯ A ¯ A ¯ A ¯ A ¯ A ¯ A ¯ A ¯ A ¯ M A A A A A A A A |A A| A ¯ B ¯ B ¯ B¯ B ¯ | B B | B B B B B B B¯ B¯ B¯ B¯ n B¯ B¯ B¯ B more extreme: x = 3 ¯ A ¯ A ¯ A ¯ A ¯ A ¯ M A A A A A A A |A A A| A ¯ ¯ ¯ ¯ ¯ ¯ ¯ n B B B B B B B |B B B| B B B B B B
¯ A ¯ B
¯ A ¯ B
¯ A ¯ B
¯ A B¯
even more extreme: x = 4 ¯ A ¯ A ¯ A ¯ A ¯ A ¯ A ¯ A ¯ A ¯ A ¯ M A A A A A A |A A A A| A ¯ B ¯ B ¯ | B B B B | B B B B B B B¯ B¯ B¯ B¯ n B¯ B¯ B¯ B Since there are S sentences, the number of possibilities to place M words of kind A in S sentences is MS , and to place n words is Sn . Hence the number of possibilities of placing word A in M sentences and of word B in n sentences S S is M n . Now consider the number of possibilities words A B co-occur (coincide) in exactly x sentences. First we place word A in M sentences (this can be done in MS ways) and then from those M sentences we choose x sentences to place word B in (there are Mx ways to do this). Hence we still have S − M
The probability of co-occurrence
115
sentences where the remaining n − x words B can be placed and this can be S−M done in n−x ways. Putting all these combinatorial possibilities together, we obtain
S M
M S−M x n−x . S S M n
We eliminate the common factor in the numerator and denominator and obtain
Px =
M x
S n
S−M n−x
, x = 0, 1, 2, . . . , min{M, n},
(8.1)
i.e. the probability of the co-occurrence of two words is given by the hypergeometric probability distribution. An association of two words is present only if the two words occur together more times than expected, the expectation being E(X) = Mn/S. If this condition is fulfilled, one computes the probability
min{M,n}
P(X ≥ xc ) =
∑
x=xc
M x
S n
S−M n−x
(8.2)
where xc is the number of observed co-occurrences of words A and B. If the computed value (8.2) is less than 0.05 we consider the association significant. For illustration let us compute the associativity of “Kind” and “Arm”. We have 32 sentences (lines) in which “Kind” occurs M = 5 times and “Arm” occurs n = 3 times. They occur together in xc = 2 lines. Hence we compute
116
Autosemantic text structure
according to (8.2)
32 − 5 3 3−x P(X ≥ 2) = ∑ 32 x=2 3 5 32 − 5 5 32 − 5 3 3−3 2 3−2 + = 32 32 3 3 5 x
= 0.0544 + 0.0020 = 0.0564.
Since this result is greater than 0.05, there is no significant association between “Kind” and “Arm”. This is because in the second line there is “Knabe” instead of “Kind”. Though they refer to the same person we ignore this. In order to illustrate the output, Table 8.3 shows the order in which the lemmas occur, then the lemmas themselves and the number of sentences (lines) in which the given lemma occurs (S). Table 8.3: Lemmas of Erlkönig and the number of lines in which they occur No. Word
S No. Word
S No. Word
S No. Word
S
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2 1 2 6 5 2 3 1 1 2 3 1 1 3 5 1 1
1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
reiten Nacht Wind Vater Kind Knabe Arm fassen warm halten Sohn bergen Gesicht sehen Erlkönig Krone Schweif
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Nebelstreif lieb kommen gehen schön Spiel spielen bunt Blume Strand Mutter gülden gewand hören versprechen ruhig bleiben
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
dürr Blatt säuseln fein Tochter warten führen nächtlich Reihen einwiegen eintanzen einsingen düster Ort scheinen alt Weide
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
grau lieben reizen Gestalt willig brauchen Gewalt anfassen Leids tun grausen ächzen erreichen Hof Mühe Not tot
The probability of co-occurrence
117
Table 8.4 is organized as follows: in the head-line the rank number is given with the word in brackets. There are four columns for each word (lemma): in the first is the rank number of the associated word; in the second the number of times the word in head-line co-occurs with the word given in the line (in brackets, fourth column); the probability of the given or more extreme cooccurrence (association) is given in the third column. For example: the word “reiten” (occurring twice, see Table 8.3) co-occurs with the word “Nacht” once; the probability of this co-occurrence computed according to (8.2) is 0.06250. As can be seen, many words occurring exactly once have a significant cooccurrence with words occurring also only once if they co-occur in the same line. Hence one could construct alternative graphs: 1. taking into account only words occurring 2 or more times, 2. taking a different significance level than 0.05, 3. changing these strategies according to the length of the text, 4. taking such a level that no lemma is isolated, 5. taking the lowest level at which all lemmas are joined, 6. considering the graph with binary weights: joined/not joined, or considering the weights given by the respective probabilities, i.e. treating the result as a weighted (fuzzy) graph. The results are so manifold that we can only give some preliminary hints and show the results using one single text. It is to be noted that the same problem exists with the denotative analysis which yields “denser” graphs because here not only the lemmas but also their references are taken into account (cf. Ziegler & Altmann 2002). Table 8.4: The probabilities (8.2) of co-occurrences of autosemantics in Erlkönig (output) 1 (reiten) with: 2 3 4 62
1 1 1 1
0.0625 0.12298 0.34476 0.0625
4 (Vater) with
2 (Nacht) with:
3 (Wind) with
(Nacht) 1 1 0.0625 (reiten) 1 1 (Wind) 3 1 0.0625 (Wind) 2 1 (Vater) 35 1 (grausen) 36 1 5 (Kind) with
0.12298 0.0625 0.0625 0.0625
(reiten) (Nacht) (dürr) (Blatt)
6 (Knabe) with (continued on next page)
118
Autosemantic text structure
Table 8.4 (continued from previous page)
1 5 14 15 31 59 62
1 1 2 1 1 1 1
0.34476 0.67335 0.08266 0.67335 0.18759 0.1875 0.1875
(reiten) (Kind) (sehen (Erlkönig) (hören) (anfassen) (grausen)
7 (Arm) with 5 6 10 63 68
2 1 1 1 1
1 2 1 1 1 1 1 1 1 1
0.67333 0.05645 0.29234 0.15625 0.15625 0.29234 0.15625 0.15625 0.15625 0.15625
(Vater (Arm) (halten) (lieb) (kommen (gehen) (ruhig) (bleiben) (ächzen) (tot)
8 (fassen) with
9 (warm) with 8 1 0.03125 (fassen) 10 1 0.0625 (halten)
10 (halten) with:
11 11 (Sohn) with:
12 (bergen) with:
5 7 8 9 63
12 13 14 18
0.29233 0.18145 0.0625 0.0625 0.0625
(Kind) (Knabe) (halten) (ächzen) (tot)
7 1 0.18145 (Arm) 21 1 0.12298 (gehen) 38 1 0.0625 (fein)
9 1 0.03125 (warm) 10 1 0.0625 (halten)
1 1 1 1 1
0.05645 9.18145 0.18145 0.09375 0.09375
4 7 10 19 20 21 33 34 63 68
(Kind) (Arm) (fassen) (warm) (ächzen
1 1 1 1
0.09375 0.09375 0.26331 0.09375
(bergen 11 1 0.09375 (Sohn) (Gesicht) 13 1 0.03125 (Gesicht) (sehen) (Nebelstreif)
13 (Gesicht) with:
14 (sehen) with:
15 (Erlkönig) with:
11 1 0.09375 (Sohn) 12 1 0.03125 (bergen)
4 2 0.08266 (Vater) 11 1 0.26331 (Sohn) 15 1 0.41027 (Erlkönig)
4 14 16 17 32 39 47 48 60 61
1 0.67333 (Vater) 1 0.41028 (sehen) 1 0.15625 (Krone) 1 0.15625 (Schweif) 1 0.15825 (verprechen) 1 0.41028 (Tochter 1 0.15625 (düster) 1 0.15625 (Ort) 1 0.15625 (Leids) 1 0.15625 (tun) (continued on next page)
The probability of co-occurrence
119
Table 8.4 (continued from previous page) 16 (Krone) with:
17 (Schweif) with:
18 (Nebelstreif) with:
15 1 0.15625 (Erlkönig) 15 1 0.15625 (Erlkönig) 11 1 0.09375 (Sohn) 17 1 0.03125 (Schweif) 16 1 0.03125 (Krone) 19 (lieb) with:
20 (kommen) with:
21 (gehen) with:
5 1 0.15625 (Kind) 5 1 0.15625 (Kind) 20 1 0.03125 (kommen) 19 1 0.03125 (lieb) 21 1 0.0625 (gehen) 21 1 0.0625 (gehen)
5 6 19 20 38
22 (schön) with:
23 (Spiel) with
24 (spielen) with:
23 24 53 54 55
22 1 0.0625 (schön) 24 1 0.03125 (spielen)
22 1 0.0625 (schön) 23 1 0.03125 (Spiel)
25 (bunt) with:
26 (Blume) with:
27 (Strand) with:
26 1 0.03125 (Blume) 27 1 (Strand)
25 1 0.03125 (bunt) 27 1 0.01325 (Strand)
25 1 0.03125 (bunt) 26 1 0.03125 (Blume)
28 (Mutter) with:
29 (gülden) with
30 (Gewand) with
1 1 1 1 1
0.0625 0.0625 0.0625 0.0625 0.0625
(Spiel) (spielen) (lieben) (reizen) (Gestalt)
1 1 1 1 1
0.29233 0.12298 0.0625 0.0625 0.0625
(Kind) (Knabe) (lieb) (kommen) (fein)
29 1 0.03125 (gülden) 28 1 0.03125 (Mutter) 28 1 0.03125 (Mutter) 30 1 0.03125 (Gewand) 30 1 0.03125 (Gewand) 29 1 0.03125 (gülden) 31 (hören) with: 4 1 0.1875
(Vater)
32 (versprechen) with:
33 (ruhig) with:
15 1 0.15625 (Erlkönig)
34 (bleiben) with:
35 (dürr) with
5 1 0.15625 (Kind) 34 1 0.03125 (bleiben) 36 (Blatt) with:
5 1 0.15625 (Kind) 33 1 0.03125 (ruhig)
3 1 0.0625 (Wind) 36 1 0.03125 (Blatt) 37 1 0.03125 (säuseln)
3 1 0.0625 (Wind) 35 1 0.03125 (dürr) 37 1 0.03125 (säuseln)
37 (säuseln) with:
38 (fein) with
39 (Tochter) with: (continued on next page)
120
Autosemantic text structure
Table 8.4 (continued from previous page) 3 1 0.0625 (Wind) 35 1 0.03125 (dürr) 36 1 0.03125 (Blatt)
6 1 0.0625 (Knabe) 21 1 0.0625 (gehen)
15 40 41 42 43 47 48
1 1 1 1 1 1 1
0.41027 0.09375 0.09375 0.09375 0.09375 0.09375 0.09375
(Erlkönig) (warten) (führen) (nächtlich) (Reihen) (düster) (Ort)
40 (warten) with:
41 (führen) with:
42 (nächtlich) with;
39 1 0.09375 (Tochter)
39 1 0.09375 (Tochter) 39 1 0.09375 (Tochter) 42 1 0.03125 (nächtlich) 41 1 0.03125 (führen) 43 1 0.03125 (Reihen) 43 1 0.03125 (Reihen)
43 (Reihen) with:
44 (einwiegen) with:
45 (eintanzen) with:
39 1 0.09375 (Tochter) 45 1 0.03125 (eintanzen) 44 1 0.03125 (einwiegen) 41 1 0.03125 (führen) 46 1 0.03125 (einsingen) 46 1 0.03125 (einsingen) 42 1 0.03125 (nächtlich) 46 (einsingen) with:
47 (düster) with:
48 (Ort) with:
44 1 0.03125 (einwiegen) 15 1 0.15625 (Erlkönig) 15 1 0.15625 (Erlkönig) 45 1 0.03125 (eintanzen) 39 1 0.09375 (Tochter) 39 1 0.09375 (Tochter) 48 1 0.03125 (Ort) 47 1 0.03125 (düster) 49 (scheinen) with:
50 (alt) with:
51 (Weide) with:
50 1 0.03125 (alt) 51 1 0.03125 (Weide) 52 1 0.03125 (grau)
49 1 0.03125 (scheinen) 49 1 0.03125 (scheinen) 51 1 0.03125 (Weide) 50 1 0.03125 (alt) 52 1 0.03125 (grau) 52 1 0.03125 (grau)
52 (grau) with:
53 (lieben) with:
54 (reizen) with
49 1 0.03125 (scheinen) 50 1 0.03125 (alt) 51 1 0.03125 (Weide)
22 1 0.0625 (schön) 54 1 0.03125 (reizen) 55 1 0.03125 (Gestalt)
22 1 0.0635 (schön) 53 1 0.03125 (lieben) 55 1 0.03125 (Gestalt)
55 (Gestalt) with:
56 (willig) with:
57 (brauchen) with:
22 1 0.0625 (schön) 53 1 0.03125 (lieben) 54 1 0.03125 (reizen)
57 1 0.03125 (brauchen) 56 1 0.03125 (willig) 58 1 0.03125 (Gewalt) 58 1 0.03125 (Gewalt) (continued on next page)
The construction of a graph
121
Table 8.4 (continued from previous page) 58 (Gewalt) with: 56 1 0.03125 (willig) 57 1 0.03125 (brauchen) 61 (tun) with: 15 1 0.15625 (Erlkönig) 60 1 0.03125 (Leids)
59 (anfassen) with: 4 1 0.1875
(Vater)
62 (grausen) with: 1 1 0.0625 4 1 0.1875
(reiten) (Vater)
60 (Leids) with: 15 1 0.15625 (Erlkönig) 61 1 0.03125 (tun) 63 (ächzen) with: 5 1 0.15625 (Kind) 7 1 0.09375 (Arm 10 1 0.0625 (halten)
64 (erreichen) with:
65 (Hof) with
65 1 0.03125 (Hof) 66 1 0.03125 (Mühe) 67 1 0.03125 (Not)
64 1 0.03125 (erreichen) 64 1 0.03125 (erreichen) 66 1 0.03125 (Mühe) 65 1 0.03125 (Hof) 67 1 0.03125 (Not) 67 1 0.03125 (Not)
67 (Not) with:
68 (tot) with:
64 1 0.03125 (erreichen) 65 1 0.03125 (Hof) 66 1 0.03125 (Mühe)
8.3
66 (Mühe) with:
5 1 0.15625 (Kind) 7 1 0.09375 (Arm)
The construction of a graph
Our aim is very simple: we want to show that texts are autosemantically structured and this structure has characteristic properties. The graph can be constructed very simply. Take step by step all lemma-pairs whose co-occurrence probability is smaller than 0.05. Join them with an edge. An alternative procedure is to set up a matrix with all lemmas having at least one significant association and to mark the significant association with 1, all other with 0. For short texts this method is simpler, for long texts both methods lead to confusing structures. However, for any further computations the coincidence matrix is used because it can be processed with any software. Let us begin with the level 0.05. Looking in Table 8.4 we find only a few associated words which are displayed as sets in the following table:
122
Autosemantic text structure
{fassen, warm}, {lieb, kommen}, {bergen, Gesicht}, {Spiel, spielen}, {düster, Ort}, {ruhig, bleiben},
{Krone, Schweif}, {dürr, säuseln, Blatt}, {Mutter, gulden, Gewand}, {bunt, Blume, Strand}, {führen, nächtlich, Reihen}, {Leids, tun}.
{lieben, reizen, gestalt}, {willig, brauchen, Gewalt}, {erreichen, Hof, Mühe, Not}, {scheinen, alt, grau, Weide}, {einsingen, eintanzen, einwiegen},
45 words are connected with some other word(s), 23 words are isolated. The sets whose elements are joined by edges represent the components of a graph but all of them are separated. They do not show a compact structure. The respective matrix is, on the other hand, too great to be presented formally and is occupied by a few 1 symbols. In longer texts the graph would be better connected; let us call this level α -level. The next level, which we call β -level, can be constructed in such a way that one determines a probability level at which there is no isolated vertex (word), each word is connected with at least one other. Looking in Table 8.4 we see that the beta-level is β = 0.1875 or rather 0.19. If we connect all words whose co-occurrence probability is less than 0.19, we obtain the graph in Figure 8.1. 1 säuseln
dürr
4 Gesicht
Blatt
Wind Nacht
reiten
hören
Vater
grausen
anfassen
sehen
bergen
Sohn
Nebelstreif
Schweif
Krone
Leids
3
Erlkönig
tun
versprechen warten
2 ruhig bleiben Kind
kommen
Arm
ächzen
halten
warm
tot
Knabe
Ort
düster
Tochter
führen
nächtlich
Reih
5 bunt
Blume Strand
fassen
7 Mutter 6 lieben
Gestalt
Gewand
schön reizen
gülden
Spiel spielen
Figure 8.1: The β -graph of Erlkönig (β = 0.19)
The construction of a graph
123
We see that there are seven components of the graph at this level. Two of them, namely no. 5 and no. 7, are cliques in which each element is connected with the other ones; they are isolated from the rest of the graph. In order to get all vertices connected one seeks in each component the lowest level at which some element is connected with some element of another component. One can easily find that component 4 can be connected with component 1 by β = 0.26331 (Sohn : sehen); component 3 and 4 by β = 0.4127 (Erlkönig : sehen) and components 1 and 2 by β = 0.67335 (Vater : Kind). The new situation, the γ -graph, is displayed in Figure 8.2. The graph remains disconnected because it is not possible to reach each vertex from each vertex. This is a usual property of poetic texts exuberant of disconnected pictures and consisting of short sentences. Connectivity has something common with thematic concentration; however, here not only the frequency is important but also the embedding of words in the network of text.
1 säuseln
dürr
Blatt
Gesicht
Wind Nacht
reiten
hören
Vater
grausen
anfassen
sehen
bergen
Sohn
Nebelstreif
Schweif
Krone
Leids
Erlkönig
tun
versprechen warten
Ort
düster
Tochter
führen
ruhig bleiben Kind
kommen
Arm
ächzen
halten
warm
tot
Knabe
nächtlich
Reih
2 bunt
Blume Strand
fassen
4 Mutter 3 lieben
Gestalt
Gewand
schön reizen
gülden
Spiel spielen
Figure 8.2: The disconnected γ -graph of Erlkönig
124
Autosemantic text structure
The edges of the graph can be unweighted, i.e. all having weight 1, or weighted by the co-occurrence probability or some of its functions. In the latter case we obtain a fuzzy or a weighted graph whose evaluation is slightly different. The text graphs examined by us have three componential levels: 1. The level of strongest associations (α -level) at which there is a number of isolated vertices and the rest are cliques; in Erlkönig, it was level 0.01325, we used 0.05. 2. The β -level at which there are no isolated vertices but several components; in Erlkönig we found 0.19. 3. The γ -level at which all vertices are connected. However, in short texts it can happen that this level does not exist. In Erlkönig we found 0.67335 at which still four components existed. The number of components (related to the number of lemmas in text) at this level shows the existence of isolated images in the text which are not central for the theme. It must be scrutinized how this phenomenon is realized in texts with longer sentences. 8.4
Degrees
Considering the graph in Figure 8.1, i.e. the β -graph, we ascribe to each word a degree according to the number of adjacent edges (coincident autosemantics). Since the edges are not oriented, the “outdegree” equals the “indegree”. We obtain the result presented in Table 8.5 showing the entrenchment of individual autosemantics in the poem. Table 8.5: Degrees of autosemantics Erlkönig Wind Tochter Kind Vater Arm halten schön säuseln dürr Blatt Sohn
7 6 6 5 4 4 4 4 3 3 3 3
Gestalt grausen ruhig bleiben kommen ächzen Gesicht bergen Schweif Krone Ort düster
3 2 2 2 2 2 2 2 2 2 2 2
Spiel reizen bunt Blume Strand spielen Mutter gülden Gewand reiten hören sehen
2 2 2 2 2 2 2 2 2 1 1 1
warm fassen Knabe Nebelstreif lieben Leids versprechen warten führen nächtlich Reih
1 1 1 1 1 1 1 1 1 1 1
Degrees
125
Usually the degrees in a network are distributed in some special way characterizing the state of the network. However, in our case the distribution is not sufficiently pronounced because of the small number of individual repetitions. It could be – as usual – captured by the right truncated zeta (Zipf) distribution (X 2 = 3.78, DF = 44, P ≈ 1.00, a = 0.5059) or by the geometric distribution (X 2 = 11.80, DF = 37, P ≈ 1.00, p = 0.0541), both having only one parameter. However, this would be premature. Many other texts must be analysed in order to get a more adequate picture. The same holds for the frequency spectrum of the degrees. In this small sample it is preliminarily sufficient to say that the spectrum follows the Poisson distribution (X 2 = 5.90, DF = 3, P = 0.12, a = 1.3314) but further investigation will surely require a generalization (cf. Chapter 9). It is to be noted that pure autosemantics need not behave like hrebs for which a modified binomial distribution has been set up (Ziegler & Altmann 2002: 106). Nevertheless, the degrees can be used for the characterization of the connotative concentration of the text. We suppose that words having significant coincidence bear some mutual association expressed by an edge between the two vertices. In our graph (Figure 8.1) the number of edges is m = 56. There are n = 47 vertices, hence the number of possible edges between them is n = n (n − 1) /2. Dividing m by this number we obtain the mean autose2 mantic degree of the text which can be called the connotative concentration (CC). Here we obtain drel (G) = or in our case CC =
2m = CC, n(n − 1)
2(56) = 0.05. 47(46)
This is a quite usual proportion which can be further processed statistically. The connectivity of a graph which in our case can be interpreted as the autosemantic coherence of the text can be evaluated by means of the number of components (k) of the graph. As can be seen in the α -graph (Figure 8.1) there are seven components which are not connected with one another. In the β -graph (Figure 8.2) there are only four components. If a graph is completely disconnected, it has n components (= number of vertices), if it is completely connected, i.e., if there is a path from each vertex to each other, there is only
126
Autosemantic text structure
one component. Hence the relative connectivity of the graph can be measured by n−k C(G) = . n−1
If the graph is completely connected (i.e, k = 1), then C(G) = 1; if it is completely disconnected, C(G) = 0, because n − n = 0. In our case we have for the α -graph C(G) = (47 − 7)/(47 − 1) = 0.87, and for the β -graph (Figure 8.2) C(G) = 0.93. Different other characteristics from graph theory can be used to describe the network structure of a text using different units but many of them are not associated with word frequency. We recommend having recourse to respective literature (cf. West 2001; Skorochod’ko 1981; Ziegler & Altmann 2002).
9 9.1
Distribution models
General theory
The study of the frequency distribution of words began probably with Estoup (1916) but it was G.K. Zipf (1935, 1949) who not only helped it to world fame but exported it from linguistics to other domains of science. Today, there is no scientific discipline in which the famous Zipf law is not known. The fact that its simplest form is identical with the Power law which can be found under different names in biology, chaos theory, self-organized criticality etc., supported its dissemination. Zipf assumed that frequency × rank = constant. He himself modified slightly the original result leading to harmonic series, added a parameter and arrived at Riemann’s zeta function which can be normalized and, if necessary, truncated at the right side resulting in Zeta distribution (known also as discrete Pareto distribution, Joos model, Riemann Zeta distribution, Zipf-Estoup law, Zipf law) or right truncated zeta distribution which can be written as Px =
x−a , x = 1, 2, . . . , R, F(R)
(9.1) R
where a is a parameter and F (R) is the normalizing constant, F (R) = ∑ i−a . i=1
If R → ∞ and a > 1, we obtain the Zeta distribution, the original Zipf law. The formula holds true in many cases but, as usual, with the progress of science researchers try to find the weak points of any approach. Perhaps the first who tried to find a more general approach was B. Mandelbrot who in the fifties of the past century brought a theoretical derivation of what is today known as Zipf-Mandelbrot law. Other researchers like G.A. Miller (1957) and W. Li (1992) showed that Mandelbrot’s generalization can be obtained from random events. Since the fifties, there was an enormous flourishing of approaches, assumptions, modifications, derivations from different conditions etc.; the references alone would fill a book. However, all researchers accepted the existence of a law behind this phenomenon adhering to M. Bunge’s dictum “Everything abides by laws” (1977: 16) or “Any fact fits a set of laws” (1967: 295, 372). The law itself has been vehemently criticized only by
128
Distribution models
G. Herdan (1956) who considered rank a secondary, artificial variable generated by frequencies themselves. He forgot that any variable in any science is an “artificial”, conceptual construct and its establishing takes place with the aim to capture a part of the reality by the limited net of our thought. Since the distribution of word frequencies is only one of the many aspects of word frequency, and since the results of previous research are so numerous, we restrict ourselves to a general, implicit evolutionary, approach which encompasses different other linguistic phenomena and other kinds of modelling. Consider the language at the beginning of its rise, in a state that can be approximated by the language of babies. The “words” are monosyllabic, the “sentences” are monolexemic and the same word or sentence is repeated until the next one is learned. Let us restrict ourselves to words. The first learned (created) word constitutes 100% of utterances until the next word is learned. During this period the probability P1 that the first word occurs in any utterance is 1. Then the second word takes part of the proportion of the first word, i.e. P1 diminishes and becomes less than 1. The probability P2 of the second word is proportional to that of the first, i.e. P2 = kP1 , where k is a proportionality constant. Then the third word is learned (created) and the probabilities must be slightly modified because the third word takes a portion from the two previous ones. In a very simple and idealized case we would get P3 = kP2 . Continuing in this way, we obtain in general Px = kPx−1 .
(9.2)
Solving this simple difference equation for x = 2, 3, . . . we get Px = kx−1 P1 .
(9.3)
The function Px converges to 0 only if 0 < k < 1. Writing, as is usual, ∞
k = q and p = 1 − q, and summing to obtain ∑ Px = 1 we finally get x=1
Px = pqx−1 , x = 1, 2, . . .
(9.4)
which represents the 1-displaced geometric distribution which was used long ago by Sigurd (1968) for capturing the rank frequency distribution of phonemes. However, the situation is not that simple, the proportionality is not always constant, hence we set up Px = g(x)Px−1
(9.5)
General theory
129
and are forced to find appropriate functions g(x) (cf. Altmann & Köhler 1996). For example, setting g(x) = a/x one obtains the Poisson distribution. But language continued to develop and g(x) could take many different shapes. It would also be naive to expect that all word frequencies in all languages behave in the same way. Of course, we assume it but since we do not know all boundary conditions by which a distribution is shaped we cannot expect that one unique model will be able to do justice to all of them. Our knowledge concerning boundary conditions is very limited and we shall never know them all. From empirical observations we already know the following universal phenomenon: The greater x (e.g. the rank) the smaller the difference of neighbouring probabilities.1 Hence it is equally reasonable to consider the relative difference between two neighbouring probabilities and set ∆Px−1 = g(x) Px−1
(9.6)
which is analogous to the continuous case dy = g(x)dx. y
(9.7)
Now, in simple cases the function g(x) equals a/x. In this case, (9.7) results in the Power law, to which a constant noise c can be added and this results, for example, in Menzerath’s law. If another independent variable is present whose effect disappears with increasing x, one can insert it, e.g. as b/x2 ; this results, for example, in he Naranan-Balasubrahmanyan law, etc. Thus writing the parameters in a uniform way we obtain the first generalization as a1 a2 ∆Px−1 = a0 + + 2. (9.8) Px−1 x x This series can be continued but Wimmer and Altmann (2005, 2006) proposed another generalization based on the fact that x need not be scaled in the same way as y and obtained an ample family of distributions defined by a1 ∆Px−1 a2 = a0 + + , c Px−1 (x + b1 ) 1 (x + b2 )c2
(9.9)
1. In theory, this holds true automatically for non-truncated distributions because P(n) must converge to zero with increasing n.
130
Distribution models
or equivalently, 1 dy a1 a2 = a0 + + . c 1 y dx (x + b1 ) (x + b2 )c2
(9.10)
The solutions of (9.10) can be discretized appropriately and they can even be transformed into one another by suitable transformations (cf. Maˇcutek 2006; Maˇcutek & Altmann 2007a,b). Up to now the full solution of (9.9) or (9.10) has not been necessary since the majority of distributions used in linguistics are special cases of (9.9) and (9.10). 9.2
Special cases
Formula (9.9) can be written explicitly as ∆Px−1 Px − Px−1 = = g(x) Px−1 Px−1 and, after inserting g(x), as Px = 1 + a0 +
a2 a1 + c (x + b1 ) 1 (x + b2 )c2
(9.11)
Px−1 .
(9.12)
The solution of 9.12 yields a very complex formula but the individual special cases are well known from distribution theory, word length and word frequency theory. Above we have already shown the geometric distribution which follows from the substitution a0 = −p, a1 = a2 = 0, yielding Px = (1 − p)Px−1 . Setting 1 − p = q we obtain the solution Px = pqx , x = 0, 1, 2, . . . or Px = pqx−1 , x = 1, 2, 3, . . . . The Poisson distribution can be obtained from the substitution a0 = −1, a2 = b1 = 0, c1 = 1 etc. A survey of distributions used in our domain is given in Table 9.1 where always q = 1 − p and 0 < p < 1. If a2 = 0, the other parameters (b2 and c2 ) are not given. If the support is x = 0, 1, 2, . . . then the distribution can either be displaced by one step to the right or truncated at x = 1, or solved directly for the given support.2 The approach allows a much greater number of special cases but up to now it has not been necessary to derive them. Another class of frequently employed distributions can be derived from (9.10); they are shown in Table 9.2. Here the equation is continuous but the result is discretized for x = 1, 2, . . . and normalized. 2. In Table 9.1, the distributions’ support is given as follows: [A]: x = 0, 1, . . . , n [B]: x = 0, 1, 2, . . . [C]: x = 1, 2, . . . , R
[D]: x = 1, 2, 3, . . .
Table 9.1: Survey of word frequency distributions resulting from (9.12) Geometric d.
pqx ax e−a x! n x n−x x p q aΓ(x − a) Γ(1 − a)Γ(x + 1) 1 [Ψ(R + 1) − Ψ(1)]x 6 π xx2 a P 0 (x!)b 1 x(x + 1) b (x + b − 1)(x + b) n(x) b b + n (b + n + 1)(x) b n(x) b + n (b + n + 1)(x) ax (x) b1 F1 (1; b; a) k+x−1 x q x m+x−1 2 F1 (k, 1; m; q) x M+x−1 K −M+n−x−1 x n− x K +n−1 n
Poisson d. Binomial d. Rouault d. Estoup d. Lotka d. Conway-MaxwellPoisson d. Simon d. Johnson-Kotz d. Waring d. Yule d. Hyperpoisson d.
Hyperpascal d.
Neg. hypergeom. d.
a0 [B]
a1
−p
[B] −1 p [A] −1 − q
b1
c1
0
a2
b2 c2 0
a>0 p (n + 1) q
0 1 0 1
0 0
[D]
0
−a − 1
0 1
0
[C]
0
−1
0 1
0
[D]
0
−2
0 1
1 0 2
[B]
−1
a
0 b
0
[D]
0
−2
1 1
0
[D]
0
−2
b 1
0
[B]
0
−b − 1
b+n 1
0
[B]
0
−b − 1
b+1 1
0
[B]
−1
a>0
b−1 1
0
[B]
−p
q(k − m)
m−1 1
0
−(K − M + n) 1
(M − 1)(n + 1) (K − M + n)
[A]
0
[(−K + M + 1)· (K + n − 1)] (−K + M − n)
0 1
131
Result Px
Special cases
Distribution
132
Distribution models
b(x) = b (b + 1) . . . (b + x − 1); (ascending factorial function) a a2 1 F1 (1; b; a) = 1 + b + b(b+1) + . . . ; (confluent hypergeometric function) 2 F1 (k, 1; m; q)
(1) (1) 1
(2) (2) 2
1 q 1 q = 1 + k m(1) + k m(2) + . . . ; (hypergeometric function) 1! 2!
Table 9.2: Some discretized distributions using (9.10) (Support is x = 1, 2, 3, . . . in all cases) Distribution / Result Px
a0
a1
b1
c1
a2
b2
c2
Lerch d. Cpx (x + b)c Good d. Cpx xa zeta d. (Zipf d.) C xa Zipf-Mandelbrot d. C (x + b)a Orlov d. Cx (x − a)b+1 Naranan-Balasubrahmanyan d. Ceax xb e−c/x
ln p
−c
b
1
0
ln p
−a
0
1
0
0
−a
0
1
0
0
−a
b
1
0
0
1
0
1
−(b + 1)
−a
1
a
b
0
1
c
0
2
The normalizing constants C in Table 9.2 must be ascertained by discrete summation. Many other distributions – not used in word frequency research – derivable form this approach can be found e.g. in Johnson & Kotz (1970) or Wimmer & Altmann (1999a), and still other ones used in this research can be found in Baayen (2001). In case of purely inductive approach it is possible to find an appropriate model as follows. One considers the cumulative relative frequencies F(x) and finds a monotonously increasing (continuously differentiable) function adequately capturing the course of the data. Since F(x) =
Zx
f (x)dx ,
−∞
F ′ (x)
the derivative of F(x), = f (x) – discretized and normalized – yields the first approximation to the rank frequency distribution or the frequency spectrum.
Applications
133
However, even the most “discrete” distributions can be derived from the general theory using a small modification, namely adding an exponent to the proportionality function in (9.12) and setting the “inner” exponents equal to 1, i.e. c a1 a2 Px = 1 + a0 + + Px−1 . (9.13) x + b1 x + b2 For example, the famous Zipf-Mandelbrot distribution is a solution of (9.13) with a0 = a2 = 0 and a1 = −1. Since in almost all instances the Zeta distribution (a0 = a2 = 0, a1 = −1, b1 = 0) is an adequate model, the formula need not be generalized any further. If one finds data that are not in agreement with the Zeta distribution, one can choose an adequate model from the battery of distributions mentioned above. Our philosophy of distributions in linguistics necessarily differs from the monolithic view that we have been familiar with for many years (cf. e.g. Orlov, Boroda & Nadarejšvili 1982). Although we believe in the common background mechanism generating our data, we do not believe that the time will come when it will be possible to know all boundary conditions present at the moment of text creation. One will never be able to capture all texts with only one distribution, though the above mentioned ones are excellent candidates. Twenty languages are not enough to test such a hypothesis. Consequently, there is a field of attractors in which texts can be realized. Texts written by one author or in one language can wander from one attractor to another according to boundary conditions furnishing not only different values of the given parameters but forcing us to apply different distributions. In general, such a field of attractors is represented by (9.9), (9.10) and (9.13) and the individual attractors are given in Tables 9.1 and 9.2. Not all of them are adequate to capture the rank-frequency or spectrum distributions of words; as a matter of fact, some of them do not even occur in this domain. We are, of course, far from being able to exactly predict the attractor given some knowledge of the language, the author, the historical epoch, the genre and the aim of the text. Nevertheless, the background model seems to be very stable, hence our confidence in the existence of laws in this domain is not unfounded.
9.3
Applications
In this section we shall check the productivity of the individual models applied to texts of different length in different languages. In Tables 9.3 to 9.5
134
Distribution models
we shall place the texts under the best fitting distribution and indicate some properties of the empirical data. Although in the majority of the above distributions the support is infinite, in some cases one obtains better fitting results if the distribution is truncated on the right side (cf. e.g. the Estoup distribution). For example, for the text R-01 the right truncated zeta resulted in X 2 = 69.06 and the zeta with infinite support in X 2 = 134.28 but the associated probability is for both P ≈ 1.0. The difference of chi-square values seems to be great but because of the very high P it can be neglected. Since for some texts different distributions yield good results, we indicate other possibilities in the respective table (9.6 Variants). Many distributions in Table 9.1 have a support x = 0, 1, 2, . . . . The software at our disposal (Altmann-Fitter 1997) automatically performs a displacement of the theoretical distribution to x = 1, 2, 3, . . . , hence whenever these distributions are adequate, we have to do with 1-displaced cases. In Tables 9.3 to 9.6 we have the following abbreviations: N V a, b, c, n, p, q, K, M X2 DF P LT RT JoKo NHG ZiMa NB HyPo HyPa Geo
text length (number of words, or word forms) vocabulary (number of different words, or word forms) parameters the empirical chi-square value for the goodness-of-fit degrees of freedom probability of error also left truncated right truncated Johnson-Kotz d. Negative hypergeometric d. Zipf-Mandelbrot d. Negative binomial d. Hyperpoisson d. Hyperpascal d. Geometric d.
In practice, one is free to decide whether one wants to choose the best fitting or the distribution with the smallest number of parameters adhering to Occam’s razor. We decided to take the former case and show the other ones as far as their P > 0.05. Of course, there is a number of other distributions yielding a good fit but we restrict the choice to those resulting from the general theory. The results allow proceeding both to generalizations and specifications concerning languages, genres, writers etc.
Applications
135
Table 9.3: Zeta (Zipf) distribution or Right truncated zeta distribution (T in the first column means truncation) Text ID B-01 (T) B-04 (T) B-05 (T) B-06 (T) B-07 (T) B-08 (T) B-09 (T) B-10 (T) E-01 (T) E-03 (T) E-04 (T) E-06 (T) E-07 (T) E-08 (T) E-10 (T) G-01 (T) G-02 (T) G-03 (T) G-04 (T) G-08 (T) G-09 (T) G-10 (T) G-11 (T) G-13 (T) G-15 (T) In-01 (T) In-02 (T) In-03 (T) In-04 (T) Kn-01 (T) Kn-02 (T) Kn 20 (T) Kn 22 (T) Kn 30 (T)
N
V
761 483 406 687 557 268 550 556 2330 3247 4622 4862 5044 5083 6246 1095 845 500 545 965 653 480 468 460 593 376 373 347 343 3713 1455 4556 4554 4499
400 286 238 388 324 179 313 317 939 1001 1232 1176 995 985 1333 539 361 281 269 509 379 301 297 253 378 221 209 194 213 1664 790 1755 1764 2005
a 0.6785 0.6202 0.6381 0.6765 0.6101 0.5438 0.6472 0.6733 0.801 0.8719 0.8762 0.908 0.9312 0.9316 0.9103 0.7449 0.7067 0.6726 0.6918 0.6781 0.6461 0.5948 0.5714 0.6197 0.5726 0.6005 0.6095 0.5764 0.5012 0.7009 0.636 0.7219 0.7361 0.6892
X2
DF
P
25.19 313 ≈1 20.04 221 ≈1 22.04 183 ≈1 43.94 293 ≈1 22.54 255 ≈1 12.7 134 ≈1 28.69 242 ≈1 37.04 239 ≈1 95.82 734 ≈1 90.35 815 ≈1 66.88 1060 ≈1 57.26 1014 ≈1 123.01 920 ≈1 125.51 919 ≈1 98.35 1211 ≈1 50.16 406 ≈1 24.58 310 ≈1 22.26 214 ≈1 14.03 217 ≈1 37.21 397 ≈1 31.74 289 ≈1 21.63 230 ≈1 17.37 229 ≈1 13.81 205 ≈1 28.31 290 ≈1 13.65 171 ≈1 11.14 169 ≈1 9.27 163 ≈1 8.22 179 ≈1 109.03 1364 ≈1 41.67 628 ≈1 133.94 1528 ≈1 150.48 1500 ≈1 113.92 1667 ≈1 (continued on next page)
136
Distribution models
Table 9.3 (continued from previous page) Text ID Lk-04 (T) Lt-03 (T) Mq-03 (T) Mr-15 (T) Mr-18 (T) Mr-22 (T) Mr-25 (T) Mr-26 (T) Mr-30 (T) Mr-31 (T) Mr-33 (T) Mr-40 (T) Mr-43 (T) R-01 (T) R-02 (T) R-04 (T) R-06 (T) Ru-01 (T) Ru-02 (T) Ru-03 (T) Ru-04 (T) Ru-05 (T) Sl-03 (T) Sl-04 (T) Sl-05 (T) T-01 (T) T-02 (T)
N
V
219 4931 1509 4693 4062 4099 4205 4146 5054 5105 4339 5218 3356 1738 2279 1284 695 2595 17205 3853 753 6025 1966 3491 5588 1551 1827
116 2703 301 1947 1788 1703 2070 2038 2911 2617 2217 2877 1962 843 1179 729 432 1240 6073 1792 422 2536 907 1102 2223 611 720
a 0.6791 0.6382 0.9671 0.7096 0.7035 0.7298 0.672 0.6723 0.653 0.6632 0.6339 0.6442 0.5979 0.7058 0.7214 0.6492 0.6188 0.7429 0.8305 0.7501 0.6382 0.7688 0.7508 0.8819 0.7877 0.8648 0.8708
X2 5.89 167.57 34.23 123.82 117.21 125.37 137.79 135.42 211.2 199.54 122.31 246.97 145.15 69.06 129.61 55.28 47.42 117.29 655.53 228.8 28.68 287.39 86.43 99.93 225.85 158.8 202.94
DF
P
91 2107 279 1646 1474 1405 1658 1634 2279 2064 1812 2221 1536 665 883 558 324 945 4690 1365 328 1974 701 875 1744 458 535
≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1
Applications
137
Table 9.4: Zipf-Mandelbrot distribution (ranks) Text ID
N
V
B-02 B-03 E-02 E-05 E-09 E-11 E-12 E-13 G-16 G-17 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-02 I-05 In-05 Kn-02 Kn-18 Kn-19 Kn-23 Kn-31 Lk-01 Lk-02 Lk-03 Lt-04 M-01 M-02 M-03 M-04 M-05 Mq-01
352 515 2971 4760 5701 8193 9088 11625 518 225 282 1829 3507 7892 7620 12356 6064 1129 414 4508 4483 1787 4685 4672 345 1633 809 4285 2062 1187 1436 1409 3635 2330
201 285 1017 1495 1574 1669 1825 1659 292 124 104 256 521 744 680 1039 2203 512 188 1738 1782 833 1738 1920 174 479 272 1910 386 281 273 302 515 289
a 0.7664 0.7607 0.9053 0.9145 0.9286 0.9674 0.9677 0.9726 0.7301 0.7954 0.9508 1.6367 1.1948 1.5492 1.6543 1.5666 0.8675 0.8162 0.8489 0.8013 0.7672 0.7593 0.7961 0.7761 0.8272 0.9757 0.964 0.7847 1.0407 1.0093 1.1475 1.0524 1.2582 1.169
b 2.8559 3.3912 1.022 0.9744 0.7167 0.8569 0.5641 0.603 3.9353 3.6729 2.1267 11.8211 2.8117 9.7918 12.7691 10.7302 0.918 1.9965 3.7684 3.3666 1.6357 3.8747 2.0064 2.6645 2.0147 0.8636 1.1923 2.6075 1.1505 1.0102 2.3193 1.1534 4.5245 2.4403
X2
DF
P
20.93 153 ≈1 24.86 221 ≈1 132.25 794 ≈1 146.95 1178 ≈1 163.08 1275 ≈1 105.03 1471 ≈1 192.48 1590 ≈1 266.13 1655 ≈1 19.95 239 ≈1 6.62 98 ≈1 6.42 87 ≈1 34.77 226 ≈1 64.53 467 ≈1 178.31 634 ≈1 126.2 580 ≈1 351.37 869 ≈1 247.38 1892 ≈1 41.72 401 ≈1 14.08 155 ≈1 11.4 1453 ≈1 132.04 1488 ≈1 43.84 680 ≈1 115.78 1465 ≈1 129.64 1577 ≈1 30.86 142 ≈1 59.62 383 ≈1 22.96 217 ≈1 145.41 1495 ≈1 64.63 362 ≈1 38.55 247 ≈1 29.86 244 ≈1 43.2 265 ≈1 59.23 464 ≈1 83.49 285 ≈1 (continued on next page)
138
Distribution models
Table 9.4 (continued from previous page) Text ID Mr-16 Mr-17 Mr-20 Mr-21 Mr-23 Mr-24 Mr-32 Mr-34 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Sl-02 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-03
N
V
3642 4170 3943 3846 4142 4255 5195 3489 968 845 892 625 1059 1371 1487 1171 617 736 447 2054
1831 1853 1725 1793 1872 1731 2382 1865 223 214 207 181 197 603 266 219 140 153 124 645
a
b
0.7176 0.7647 0.7783 0.7276 0.7682 0.7861 0.7698 0.7203 1.0524 1.0532 1.0253 1.0613 1.0694 0.8176 1.1978 1.144 1.4602 1.2956 1.1857 0.9783
2.8805 4.7619 4.6771 3.3979 4.2023 4.0401 3.2183 7.138 0.7553 1.1918 1.7099 2.0025 2.4132 1.9223 2.0183 2.0546 6.2907 3.314 2.8994 0.7324
X2 118.15 98.19 89.08 90 111.06 77.57 188.53 138.39 20.59 25.36 18.33 12.5 25.47 48.22 20.86 12.89 15.63 13.12 11.52 159.81
DF
P
1440 1508 1400 1460 1499 1439 1863 1454 190 180 191 150 193 475 230 200 115 130 102 491
≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1
Applications
139
Table 9.5: Negative hypergeometric (ranks) Text ID G-06 G-07 G-12 G-14 H-01 H-02 H-03 H-04 H-05 I-03 Lt-01 Lt-02 Lt-05 Lt-06 Mq-02 R-03 Sl-01
N
V
545 263 251 184 2044 1288 403 936 413 854 3311 4010 1354 829 451 1264 756
326 169 169 129 1079 789 291 609 290 483 2211 2334 909 609 143 719 457
K 1.3435 1.3761 1.3428 1.3631 1.2468 1.0286 1.0041 1.0446 1.1165 1.2155 1.1768 1.1520 1.3065 1.3293 1.9172 1.2463 1.2316
M 0.4324 0.4707 0.4709 0.4958 0.3939 0.3295 0.3678 0.3508 0.4007 0.3658 0.4101 0.3534 0.4587 0.5017 0.4104 0.3775 0.3938
n 325 168 168 128 1077 788 290 608 289 482 2210 2333 908 608 152 718 456
X2 18.07 6.81 11.00 7.40 53.56 48.28 27.30 49.44 21.99 30.95 78.85 126.94 38.62 30.34 48.04 51.10 19.37
DF
P
258 132 128 96 847 579 198 439 207 375 1666 1793 704 459 120 562 353
≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1 ≈1
140
Distribution models
Table 9.6: Variants (ranks) Text ID
Other adequate distributions
Text ID
Other adequate distributions
ZiMa, NHG, Zeta, NB RT Zeta, NHG, Zeta, NB, Good, HyPo, Waring, Geo RT Zeta, NHG, Zeta, NB, Geo NHG, ZiMa, Zeta, NB ZiMa, NHG, Zeta, NB, Geo
Kn-31 Lk-01
RT Zeta, Zeta, NHG, NB RT Zeta, NHG, Zeta, NB, Waring, Good, Geo RT Zeta, Zeta
ZiMa, NHG, Zeta, NB ZiMa, NHG, Zeta, NB NHG, ZiMa, Zeta, NB, Geo, Waring, HyPo ZiMa, NHG, Zeta, NB NHG, ZiMa, Zeta, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG ZiMa, Zeta, NHG, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG ZiMa, Zeta, NHG, NB RT Zeta, Good, Zeta, NHG ZiMa, Zeta, NHG, NB
Lt-01 Lt-02 Lt-03
Mq-03 Mr-15
E-10 E-11
ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG, HyPa, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB
E-12 E-13
RT Zeta, Zeta, NHG RT Zeta, Zeta, NB, NHG
Mr-18 Mr-20
G-01 G-02
NHG, ZiMa, Zeta, NB ZiMa, NHG, Zeta, NB
Mr-21 Mr-22
B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-08 E-09
Lk-02 Lk-03 Lk-04
Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02
Mr-16 Mr-17
RT Zeta, Zeta, NHG ZiMa, NHG, Good, Zeta, Waring, Geo, HyPo RT Zeta, Zeta, ZiMa RT Zeta, Zeta, ZiMa NHG, Zeta, ZiMa RT Zeta, Zeta, NHG RT Zeta, ZiMa, Zeta RT Zeta, ZiMa, Zeta, NB RT Zeta, Zeta RT Zeta, NHG, Zeta, NB RT Zeta, NB, NHG, Zeta RT Zeta, Zeta, NB, NHG Zeta, RT Zeta NB, NHG, RT Zeta Good, RT Zeta, ZiMa, NB, Zeta, HyPa, JoKo, Waring ZiMa, Zeta, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB RT Zeta, Good, Zeta, NHG, NB ZiMa, Zeta, NHG, NB RT Zeta, Good, Zeta, NHG, NB RT Zeta, HNG, HyPa, NB ZiMa, Zeta, NHG, NB (continued on next page)
Applications
141
Table 9.6 (continued from previous page) Text ID
Other adequate distributions
Text ID
Other adequate distributions
Mr-23 Mr-24 Mr-26 Mr-30
RT Zeta, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB ZiMa, Zeta, NHG, NB ZiMa, Zeta, NHG, NB
Mr-31 Mr-32 Mr-33 Mr-34 Mr-35
ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG ZiMa, Zeta, NHG, NB
Mr-40
NHG, Zeta, ZiMa, NB
Mr-43
NHG, ZiMa, Zeta, NB
R-01 R-02
ZiMa, NHG, Zeta NHG, Zeta, Good
R-03
RT Zeta, ZiMa, Zeta
R-04
NHG, ZiMa, Zeta, NB
R-06 Rt-01 Rt-02 Rt-03
Hw-03
NHG, ZiMa, Zeta, NB ZiMa, NHG, Zeta, NB RT Zeta, ZiMa, Zeta NB Good, RT Zeta, ZiMa, Zeta, NB, Geo, HyPo, Waring ZiMa, NHG, Zeta, Nb NHG, ZiMa, Zeta, NB NHG, ZiMa, Zeta, NB NHG, ZiMa, Zeta, NB RT Zeta, ZiMa, Zeta, NB, Geo, Waring, HyPo ZiMa, NHG, Zeta, NB, HyPo, Waring, Geo RT Zeta, ZiMa, Good, HyPa, NB, Zeta, Geo, HyPo, Waring ZiMa, NHG, Zeta, NB, HyPo (RT) Zeta, NHG, NB, Geo, Good, HyPo, Waring, JoKo RT Zeta, NHG, Zeta, NB, Waring, Geo, HyPo RT Zeta, NHG, NB, Good, Geo, HyPo, Waring, Zeta, JoKo RT Zeta, Zeta, ZiMa Rt Zeta, ZiMa, Zeta, Good Zeta, RT Zeta, ZiMa RT Zeta, Zeta, JoKo, ZiMa, NB RT Zeta, NHG, Good, NB, JoKo, Geo, Zeta RT Zeta, Zeta, NB, NHG
Hw-05 I-02 I-03 I-05
Waring RT Zeta, Zeta, NHG RT Zeta, ZiMa, Zeta, NB RT Zeta, NHG, Zeta, NB
Ru-01 Ru-02 Ru-03 Ru-04
NHG, ZiMa, Zeta RT Zeta, Zeta, NHG, NB RT Zeta, NHG, Zeta, JoKo RT Zeta, NHG, NB, Zeta, JoKo, Waring RT Zeta, NHG, NB, Zeta, JoKo NB, NHG, RT Zeta, Waring, JoKo, Zeta NHG, ZiMa, Zeta, NB Zeta, ZiMa, NHG ZiMa, Zeta, NHG ZiMa, Zeta, NHG, NB (continued on next page)
G-03 G-04 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14
G-15 G-16 G-16 G-17
H-02 H-03 H-04 H-05 Hw-01
Rt-04 Rt-05
142
Distribution models
Table 9.6 (continued from previous page) Text ID
Other adequate distributions
Text ID
Other adequate distributions
Ru-05
ZiMa, Zeta, NHG, NB
Sl-01
RT Zeta, ZiMa, Zeta, NB
Sl-02
RT Zeta, NHG, Zeta
Sl-03
NHG, Zeta, Zima, NB
Sl-04
ZiMa, Zeta, NHG
Kn-01 Kn-02 Kn-18 Kn-19
ZiMa, Zeta, NB, Geo, Waring, HyPo ZiMa, NHG, Zeta, NB, Waring, Geo, HyPo ZiMa, NHG, NB, Zeta, Geo, HyPo, Waring, JoKo ZiMa, Zeta, NB, Geo, HyPo, Waring RT Zeta, NHG, NB, Waring, Zeta, HyPo, Good, JoKo NHG, Zeta, ZiMa, NB RT Zeta, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB RT Zeta, NHG, Zeta, NB
Sl-05 Sm-01 Sm-02 Sm-03
Kn-20
ZiMa, HyPa, Zeta, NHG, NB
Sm-04
Kn-21
NHG, ZiMa, Zeta, HyPa, NB
Sm-05
Kn-22 Kn-23 Kn-30
ZiMa, Zeta, NHG, NB RT Zeta, NHG, NB NHG, Zeta, ZiMa, NB
T-01 T-02 T-03
ZiMa, Zeta, NHG RT Zeta, Zeta, NB, NHG RT Zeta, NB, NHG, Zeta Good, Waring, RT Zeta, NB, JoKo, NHG RT Zeta, Good, NB, NHG, JoKo, Zeta, Waring RT Zeta, NHG, NB, JoKo, Waring, Zeta ZiMa ZiMa, Zeta, Good, NHG RT Zeta, Zeta, Good, NHG
In-01 In-02 In-03 In-04 In-05
As can be seen from these tables, the fitting always yields a probability P ≈ 1, strongly corroborating the theory. Even the distributions considered as secondary variants yield mostly the same results. For each text they are listed according to the value of X 2 but they are equivalent with the distribution included in the main table. Overall, the (truncated) Zeta distribution would do in almost all cases. It is adequate for the selected texts and has only one parameter. There is, of course, a number of other distributions adequate in many cases, being some special cases of those in Tables 9.1 or 9.2. On the other hand, some distributions were never adequate, e.g. Poisson, Binomial, Conway-Maxwell-Poisson etc., and distributions without parameters were not even tested. In principle, one should try to capture all data with only one distribution – which had been possible – but we wanted to show all possibilities given by the theory.
The spectrum
9.4
143
The spectrum
Zipf’s consideration of rank frequency distributions is restricted; it does not concern the frequency spectrum. We can transform the rank frequency distribution into a spectrum by adding together all ranked frequencies whose value is x so that x ≤ fr < x + 1. (9.14) This transformation does not give all distributions from Tables 9.1 and 9.2. Besides, the transformation is not always easy. Consider e.g. the Zeta distribution with fr = C/ra . Inserting it in (9.14) we obtain C x ≤ a < x + 1. (9.15) r Solving the inequalities for r we obtain c c < r ≤ 1/a , 1/a (x + 1) x hence fx = c
1 x1/a
1 − (x + 1)1/a
, x = 1, 2, 3, . . .
(9.16)
called extended Zipf-Mandelbrot distribution (cf. Baayen 1989:177; Chitashvili, Baayen 1993; Mandelbrot 1961). Another transformation has been shown by Haight (1969). Starting from x−
1 1 < f (r) < x + , 2 2
(9.17)
we get, e.g. for the Zeta distribution Px =
1 1 − , x = 1, 2, 3, . . . b (2x − 1) (2x + 1)b
(9.18)
where b = 1/a. This distribution is called Haight-zeta distribution (cf. Wimmer & Altmann 1999a). Although the two cases presented above give satisfactory results for many frequency spectra, it can be shown empirically that this is not the case in many other instances and that these transformations sometimes lead to very complex formulas. Nevertheless, the above theory holds true and in this domain the conjecture of the existence of an attractor field seems to be more likely than with rank frequencies. The results of fitting
144
Distribution models
to the frequency spectra of the same texts as in Tables 9.3 to 9.6 is shown in Tables 9.7 to 9.16 with additional variants from the general family. Here, too, the “best” fit can be found in one of Tables 9.7 to 9.15, the other “good” ones are shown in Table 9.16. Table 9.7: Frequency spectra: Johnson-Kotz distribution Text B-09 E-01 E-05 E-07 E-08 E-10 E-11 E-13 G-03 G-09 H-01 H-03 I-01 I-02 Kn-31 Mq-03 Mr-22 R-06 Rt-02 Sl-02 Sm-05 T-02
N
V
a
X2
DF
P
550 2330 4760 5004 5083 6246 8139 11265 500 653 2044 403 6064 6064 4672 1509 4099 695 845 1371 447 1827
313 939 1495 1597 985 1333 1669 1659 281 379 1079 291 2203 2203 1920 301 1703 432 214 603 124 720
0.3064 0.4125 0.5639 0.5136 1.0942 0.5888 0.9167 1.1709 0.2921 0.2708 0.2825 0.1345 0.3694 0.3694 0.4721 1.2429 0.4440 0.2687 0.7085 0.4271 0.9441 0.3375
17.19 28.30 54.88 48.53 46.35 59.65 48.69 77.41 8.62 10.95 26.56 11.13 36.98 36.98 42.35 21.89 38.52 22.20 8.80 14.80 14.32 26.70
11 31 49 46 56 52 68 77 13 14 29 9 48 48 47 31 44 15 19 24 15 25
0.10 0.61 0.26 0.45 0.82 0.22 0.98 0.47 0.80 0.69 0.60 0.27 0.88 0.88 0.67 0.89 0.71 0.10 0.98 0.93 0.50 0.37
The spectrum
145
Table 9.8: Frequency spectra: Negative binomial distribution Text B-02 B-07 B-08 G-07 G-12 G-14 G-16 G-16 G-17 In-01 Lk-01 Sm-03
N
V
k
p
X2
DF
P
352 557 268 263 251 184 225 518 225 376 375 617
201 324 179 169 169 129 124 292 124 221 174 140
0.1426 0.1452 0.1374 0.1490 0.1466 0.1113 0.2792 0.1774 0.2792 0.1644 0.1469 0.2232
0.1441 0.1688 0.1933 0.2154 0.1630 0.1697 0.2488 0.1723 0.2488 0.1887 0.1244 0.0616
9.40 6.80 4.08 3.05 4.83 1.28 2.41 6.03 2.41 12.56 10.01 16.09
8 10 6 6 6 5 6 10 6 8 10 19
0.31 0.74 0.67 0.80 0.57 0.94 0.88 0.81 0.88 0.13 0.44 0.65
X2
DF
P
Table 9.9: Frequency spectra: Waring Text B-01 B-10 B-04 B-05 B-06 E-02 G-01 G-04 G-08 G-10 G-15 H-02 H-04 H-05 Hw-02 Kn-30 Lt-02
N
V
b
n
761 556 983 406 687 2971 1095 545 965 480 593 1288 936 413 1829 4499 4010
400 317 286 238 388 1017 530 269 509 301 378 789 609 290 256 2005 2334
1.4952 1.6923 1.5960 1.1240 1.1746 0.8492 1.3240 1.4961 1.3931 1.9651 1.3787 1.7292 1.6900 1.7767 1.1420 1.2926 1.5091
0.5051 0.5063 0.4511 0.2988 0.3051 0.3242 0.4480 0.6676 0.4729 0.5202 0.3300 0.4059 0.3281 0.2761 1.9353 0.5977 0.3654
4.39 12 0.98 7.50 9 0.59 5.47 8 0.71 6.58 8 0.58 8.88 11 0.63 37.77 36 0.39 24.56 15 0.06 6.24 11 0.86 9.96 14 0.76 24.50 7 0.93 3.81 8 0.87 8.84 12 0.72 6.06 9 0.73 3.23 5 0.66 25.73 29 0.64 31.54 37 0.72 26.70 23 0.27 (continued on next page)
146
Distribution models
Table 9.9 (continued from previous page) Text Lt-03 Lt-04 Lt-05 Lt-06 Mr-26 Mr-30 Mr-31 Mr-32 Mr-40 Mr-43 R-02 R-03 R-04 Ru-01 Ru-02 Ru-03 Ru-05 Sl-01 Sl-03 Sl-04 Sl-05 Sm-01 Sm-02 T-01 T-03
N
V
b
n
X2
DF
P
4931 4285 1354 829 4146 5054 5105 5195 5218 3356 2279 1264 1284 2595 17205 3853 6025 756 1966 3491 5588 1487 1171 1551 2054
2703 1910 909 609 2038 2911 2617 2382 2877 1962 1179 719 729 1240 6073 1792 2536 457 907 1102 2223 266 219 611 645
1.4375 1.2115 1.7619 1.9730 1.4461 1.2928 1.3962 1.1992 1.3674 1.5086 1.2498 1.2845 1.2613 1.2851 1.1628 1.1590 1.1653 1.6708 1.4205 1.1618 1.1477 1.1108 1.2225 0.9118 1.0626
0.4579 0.4899 0.4581 0.3288 0.5358 0.4139 0.4921 0.4586 0.4135 0.4542 0.3712 0.3415 0.3406 0.3976 0.4436 9.3614 0.4312 0.4211 0.5556 0.6620 0.4529 1.3721 1.6387 0.2837 0.4673
19.79 14.87 11.46 12.82 18.26 38.48 35.94 40.17 32.37 28.22 27.84 9.13 11.07 14.01 44.92 35.98 33.67 5.90 12.64 32.89 30.33 16.53 19.48 14.82 25.26
30 35 13 7 28 34 32 38 32 22 23 16 16 24 68 32 42 9 21 36 41 26 23 22 25
0.92 ≈1 0.57 0.08 0.92 0.27 0.29 0.37 0.45 0.17 0.22 0.91 0.81 0.95 0.99 0.29 0.82 0.75 0.92 0.62 0.89 0.92 0.99 0.88 0.45
The spectrum
147
DF
P
Table 9.10: Frequency spectra: Zeta Text E-03 E-09 E-12(T) G-06-(T) G-10-(T) G-13Hw-01(T) Hw-04 Hw-05 I-04 I-05 In-02 Kn-01 Kn-02 Kn-18 Kn-21 Kn-22 Kn-23 Lk-02(T) Lk-03 Lk-04 Lt-01 M-01 M-03(T) M-04 M-05 Mq-02 Mr-15 Mr-16 Mr-18 Mr-20(T) Mr-23 Mr-24(T) Mr-33
N
V
a
3247 5701 9088 545 468 460 282 7892 7620 3258 1129 373 3713 4508 4483 1455 4554 4685 1633 809 219 3311 2062 1436 1409 3635 451 4693 3642 4062 3943 4142 4255 4339
1001 1574 1825 326 297 253 103 744 680 1237 512 209 1664 1738 1782 790 1764 1738 479 272 116 2211 396 273 302 515 143 1947 1831 1788 1725 1872 1731 2217
2.0540 2.0351 1.9167 2.5423 0.5948 2.3157 1.6762 1.6327 1.5944 2.2127 2.2349 2.2802 2.2287 2.0811 2.1233 2.4230 2.1502 2.0835 2.0594 2.0403 2.3397 2.8021 1.7600 1.6367 1.8599 1.5716 1.9687 2.1852 2.3850 2.2704 2.1862 2.2560 2.0935 2.3396
X2
37.04 38 0.51 56.44 49 0.22 48.38 66 0.95 5.33 8 0.72 21.83 230 ≈1 9.72 11 0.56 7.64 11 0.75 57.04 73 0.92 63.60 73 0.78 27.43 31 0.65 12.87 18 0.80 7.34 8 0.50 25.60 36 0.90 49.78 43 0.22 50.85 43 0.19 15.48 18 0.63 34.36 42 0.79 43.03 45 0.56 17.19 23 0.80 8.77 18 0.96 1.32 7 0.99 19.95 18 0.34 24.10 36 0.94 30.43 33 0.60 29.40 27 0.34 56.52 59 0.57 6.09 14 0.55 20.91 40 0.85 19.82 28 0.87 30.01 34 0.66 22.84 32 0.88 27.09 33 0.76 29.84 39 0.85 28.58 32 0.84 (continued on next page)
148
Distribution models
Table 9.10 (continued from previous page) N
V
a
X2
DF
P
3489 4205 1738 1032 968 625 753 736
1865 2070 843 567 223 181 422 153
2.3856 2.4637 2.3073 2.5027 1.8662 1.8623 2.4453 1.7142
23.96 9.72 16.83 15.88 12.27 10.31 10.44 15.33
23 16 22 15 22 17 13 21
0.41 0.88 0.77 0.39 0.95 0.89 0.66 0.81
Text Mr-34(T) Mr-35(T) R-01 R-05 Rt-01 Rt-04(T) Ru-04 Sm-04
Table 9.11: Frequency spectra: Good distribution N
V
a
p
X2
DF
P
3507 12356 854 347 1787 4556 1187 4170 3846 892
521 1039 483 194 833 1755 281 1853 1793 207
1.6140 1.6262 1.8418 1.8418 1.9918 2.0158 1.6020 2.1171 2.1791 1.4699
0.9935 0.9990 0.8775 0.8775 0.9566 0.9799 0.9800 0.9853 0.9866 0.9800
31.68 58.87 2.91 2.91 12.95 22.13 23.14 27.48 17.98 30.04
44 84 7 7 19 31 25 31 28 25
0.92 0.98 0.89 0.89 0.84 0.88 0.57 0.87 0.93 0.22
Text Hw-03 Hw-06 I-03 In-03 Kn-19 Kn-020 M-02 Mr-17 Mr-21 Rt-03
Table 9.12: (Positive) Yule distribution Text
N
V
b
G-02 In-04
845 343
361 213
1.5424 4.4563
X2 12.9 1.67
DF 17 5
P 0.74 0.89
149
The spectrum Table 9.13: Negative hypergeometric distribution Text
N
V
K
M
n
X2
In-05
414
188
2.4591
0.1970
15
155.12
8
DF
DF
P 0.74
Table 9.14: Hyperpascal Text
N
V
k
m
q
X2
B-03
515
285
0.3541
1.9136
0.9231
8.01
P
9
0.53
Table 9.15: Zipf-Mandelbrot distribution Text E-04 E-06 Hw-03 Mq-01 Rt-05
N
V
a
b
X2
4622 4862 3473 2330 1059
1232 1176 370 289 197
2.3233 1.9796 1.7673 1.8479 1.7012
0.8572 0.1515 0.9746 1.1256 0.5287
36.66 46.18 50.40 38.23 18.77
DF 38 50 49 40 28
P 0.53 0.60 0.42 0.55 0.91
Table 9.16: Variants (spectrum) Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08
Other adequate distributions
Text ID
Other adequate distributions
ZiMa, NHG, Zeta, NB RT Zeta, NHG, Zeta, NB, Good, HyPo, Waring, Geo RT Zeta, NHG, Zeta, NB, Geo NHG, ZiMa, Zeta, NB ZiMa, NHG, Zeta, NB, Geo
Kn-31 Lk-01
RT Zeta, Zeta, NHG, NB RT Zeta, NHG, Zeta, NB, Waring, Good, Geo RT Zeta, Zeta
ZiMa, NHG, Zeta, NB ZiMa, NHG, Zeta, NB NHG, ZiMa, Zeta, NB, Geo, Waring, HyPo
Lt-01 Lt-02 Lt-03
Lk-02 Lk-03 Lk-04
RT Zeta, Zeta, NHG ZiMa, NHG, Good, Zeta, Waring, Geo, HyPo RT Zeta, Zeta, ZiMa RT Zeta, Zeta, ZiMa NHG, Zeta, ZiMa (continued on next page)
150
Distribution models
Table 9.16 (continued from previous page) Text ID
Other adequate distributions
Text ID
Other adequate distributions RT Zeta, Zeta, NHG RT Zeta, ZiMa, Zeta RT Zeta, ZiMa, Zeta, NB RT Zeta, Zeta RT Zeta, NHG, Zeta, NB RT Zeta, NB, NHG, Zeta RT Zeta, Zeta, NB, NHG Zeta, RT Zeta NB, NHG, RT Zeta Good, RT Zeta, ZiMa, NB, Zeta, HyPa, JoKo, Waring ZiMa, Zeta, NB ZiMa, Zeta, NHG, NB
B-09 B-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08
ZiMa, NHG, Zeta, NB NHG, ZiMa, Zeta, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG ZiMa, Zeta, NHG, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG ZiMa, Zeta, NHG, NB RT Zeta, Good, Zeta, NHG ZiMa, Zeta, NHG, NB
Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02
E-08 E-09
Mq-03 Mr-15
E-10 E-11
ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG, HyPa, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB
E-12 E-13
RT Zeta, Zeta, NHG RT Zeta, Zeta, NB, NHG
Mr-18 Mr-20
G-01 G-02 G-03 G-04 G-06 G-07
NHG, ZiMa, Zeta, NB ZiMa, NHG, Zeta, NB NHG, ZiMa, Zeta, NB ZiMa, NHG, Zeta, NB RT Zeta, ZiMa, Zeta NB Good, RT Zeta, ZiMa, Zeta, NB, Geo, HyPo, Waring ZiMa, NHG, Zeta, Nb NHG, ZiMa, Zeta, NB NHG, ZiMa, Zeta, NB NHG, ZiMa, Zeta, NB RT Zeta, ZiMa, Zeta, NB, Geo, Waring, HyPo ZiMa, NHG, Zeta, NB, HyPo, Waring, Geo RT Zeta, ZiMa, Good, HyPa, NB, Zeta, Geo, HyPo, Waring
Mr-21 Mr-22 Mr-23 Mr-24 Mr-26 Mr-30
RT Zeta, Zeta, NHG, NB RT Zeta, Good, Zeta, NHG, NB ZiMa, Zeta, NHG, NB RT Zeta, Good, Zeta, NHG, NB RT Zeta, HNG, HyPa, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB ZiMa, Zeta, NHG, NB ZiMa, Zeta, NHG, NB
Mr-31 Mr-32 Mr-33 Mr-34 Mr-35
ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB ZiMa, Zeta, NHG, NB RT Zeta, Zeta, NHG ZiMa, Zeta, NHG, NB
Mr-40
NHG, Zeta, ZiMa, NB
Mr-43
NHG, ZiMa, Zeta, NB
G-08 G-09 G-10 G-11 G-12 G-13 G-14
Mr-16 Mr-17
(continued on next page)
The spectrum
151
Table 9.16 (continued from previous page) Text ID
Text ID
Other adequate distributions
ZiMa, NHG, Zeta, NB, HyPo (RT) Zeta, NHG, NB, Geo, Good, HyPo, Waring, JoKo RT Zeta, NHG, Zeta, NB, Waring, Geo, HyPo RT Zeta, NHG, NB, Good, Geo, HyPo, Waring, Zeta, JoKo RT Zeta, Zeta, ZiMa Rt Zeta, ZiMa, Zeta, Good Zeta, RT Zeta, ZiMa RT Zeta, Zeta, JoKo, ZiMa, NB RT Zeta, NHG, Good, NB, JoKo, Geo, Zeta RT Zeta, Zeta, NB, NHG
R-01 R-02
ZiMa, NHG, Zeta NHG, Zeta, Good
R-03
RT Zeta, ZiMa, Zeta
R-04
NHG, ZiMa, Zeta, NB
R-06 Rt-01 Rt-02 Rt-03
Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01
RT Zeta, ZiMa, Zeta, NB
Sl-02
RT Zeta, NHG, Zeta
Sl-03
NHG, Zeta, Zima, NB
Sl-04
ZiMa, Zeta, NHG
Kn-01 Kn-02 Kn-18 Kn-19
Waring RT Zeta, Zeta, NHG RT Zeta, ZiMa, Zeta, NB RT Zeta, NHG, Zeta, NB ZiMa, Zeta, NB, Geo, Waring, HyPo ZiMa, NHG, Zeta, NB, Waring, Geo, HyPo ZiMa, NHG, NB, Zeta, Geo, HyPo, Waring, JoKo ZiMa, Zeta, NB, Geo, HyPo, Waring RT Zeta, NHG, NB, Waring, Zeta, HyPo, Good, JoKo NHG, Zeta, ZiMa, NB RT Zeta, Zeta, NHG, NB RT Zeta, Zeta, NHG, NB RT Zeta, NHG, Zeta, NB
NHG, ZiMa, Zeta RT Zeta, Zeta, NHG, NB RT Zeta, NHG, Zeta, JoKo RT Zeta, NHG, NB, Zeta, JoKo, Waring RT Zeta, NHG, NB, Zeta, JoKo NB, NHG, RT Zeta, Waring, JoKo, Zeta NHG, ZiMa, Zeta, NB Zeta, ZiMa, NHG ZiMa, Zeta, NHG ZiMa, Zeta, NHG, NB ZiMa, Zeta, NHG, NB
Sl-05 Sm-01 Sm-02 Sm-03
Kn-20
ZiMa, HyPa, Zeta, NHG, NB
Sm-04
ZiMa, Zeta, NHG RT Zeta, Zeta, NB, NHG RT Zeta, NB, NHG, Zeta Good, Waring, RT Zeta, NB, JoKo, NHG RT Zeta, Good, NB, NHG, JoKo, Zeta, Waring (continued on next page)
G-15 G-16 G-16 G-17
H-02 H-03 H-04 H-05 Hw-01 Hw-03 Hw-05 I-02 I-03 I-05 In-01 In-02 In-03 In-04 In-05
Other adequate distributions
Rt-04 Rt-05
152
Distribution models
Table 9.16 (continued from previous page) Text ID
Other adequate distributions
Text ID
Other adequate distributions
Kn-21
NHG, ZiMa, Zeta, HyPa, NB
Sm-05
Kn-22 Kn-23 Kn-30
ZiMa, Zeta, NHG, NB RT Zeta, NHG, NB NHG, Zeta, ZiMa, NB
T-01 T-02 T-03
RT Zeta, NHG, NB, JoKo, Waring, Zeta ZiMa ZiMa, Zeta, Good, NHG RT Zeta, Zeta, Good, NHG
9.5
Evaluations
Let us consider first the rank-frequency models. As can be seen in Tables 9.3 to 9.6 the Zeta distribution or its right truncated (RT) variant is sufficient as a model in all cases. The probability of a good fit is in all cases P ≈ 1. Even in tables where the Zeta distribution was not the “best” one, its probability was usually ≈ 1 but its chi-square value was slightly greater than that of the tabulated distribution. Not all distributions of the general theory could be applied. Some of them do not even appear in the table of possibilities. The three main cases, namely Zeta, Zipf-Mandelbrot and negative hypergeometric are sufficient to model any (homogeneous) rank frequency data of word forms. On economy grounds it is recommended to use Zeta as much as possible and to switch to more complex distributions (having more parameters) only if Zeta is not satisfactory. Let us define the function Φ(a, b, c) =
∞
ax ∑ c x=1 (x + b)
(9.19)
which is analogous to the Lerch function (Erdélyi et al. 1955). Hence the Zeta distribution can be written as Px =
1 xa Φ(1, 0, a)
, x = 1, 2, 3, . . . ,
(9.20)
and the Zipf-Mandelbrot distribution as Px =
1 , x = 1, 2, 3, . . . . (x + b)a Φ(1, b, a)
(9.21)
Evaluations
153
In case of truncation on the right hand side we have n
ax = ∑ c x=1 (x + b)
∞
∞ ax ax − ∑ ∑ c c x=1 (x + b) x=n+1 (x + b) ∞
=
ax − ∑ c x=1 (x + b)
∞
ax+n ∑ c x=1 (x + n + b)
(9.22)
=Φ(a, b, c) − an Φ(a, b + n, c). Hence the right truncated Zeta distribution can be written as Px =
1 xa [Φ(1, 0, a) − Φ(1, n, a)]
, x = 1, 2, 3, . . . , n,
(9.23)
and the right truncated Zipf-Mandelbrot distribution as Px =
1 , x = 1, 2, 3, . . . , n. (x + b)a [Φ(1, b, a) − Φ(1, b + n, a)]
(9.24)
The negative hypergeometric distribution has a “naturally” limited support. Although in many cases there are deviations from these distributions, especially at the low ranks, they are sufficient for any purpose. Backed by the above results we can say that in order to capture a rank frequency distribution of a homogeneous, not mixed, text a field of three attractors is sufficient. The spectrum is more variable and the attractor field is greater. But even here the Zeta distribution alone would be sufficient, though it does not always represent the best of the actual fits. We can conclude that Zipf was right, and all other distributions, especially the many modifications of the Zipf-Mandelbrot distribution, arose because one either used mixed texts (e.g. complete corpora) or one was forced to consider some boundary conditions which were not present in other texts. Herdan’s proposal to ignore the rank frequency and consider only the spectrum for which the Waring distribution is the true model must be considered somewhat premature, even if this distribution is adequate in most cases. In general we can state that there is no “true” model as there are many possibilities of text formation. However, we can say that there is a background mechanism based on self-regulation forcing texts to remain grosso modo in a prescribed form, violating it only to a certain extent and not surpassing the limit of comprehensibility for the reader.
154
Distribution models
9.6
Ord’s criterion
There are different possibilities to characterize a distribution. One of them is the use of functions of moments enabling us to place the texts in Cartesian coordinates and get a visual impression of their patterning. Here we shall use Ord’s criterion (Ord 1972) defined for empirical data as m2 m3 I = ′ and S = (9.25) m1 m2 where m′1 is the mean of the variable, m2 and m3 are the second and third central moments of the variable. In general one defines them as m′1 =
1 K ∑ x fx N x=1
and
mz =
1 K ∑ (x − m′1 )z fx N x=1
where x is the rank and fx is the corresponding frequency for rank frequency distributions; for spectra, x is the frequency and g(x) is the number of words occurring x times. N is the text length for rank frequency (or the vocabulary V for spectra), K is the number of classes (that is, K = V for rank frequency but K = W for spectra), and z is the order of the moment. In empirical cases one sometimes divides by (N − 1) instead of N in order to obtain a better estimator but in our case this is irrelevant since we are usually concerned with very large N values. For general orientation we present the adapted Ord’s scheme in Figure 9.1; the binomial line (CB) crossing the Poisson point (B) and continuing in negative binomial line (BD) is the straight line S = 2I − 1.
Figure 9.1: Ord’s (I, S) chart (Ord 1967)
Ord’s criterion
155
Table 9.17 represents the (I, S) values obtained for the English texts. Table 9.17: The (I, S) domain of English texts Text
N
V
I
S
E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13
2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8193 9088 11625
939 1017 1001 1232 1495 1176 1597 985 1574 1333 1669 1825 1659
347.59 297.73 377.96 449.07 576.15 434.76 630.13 347.63 618.19 481.67 606.86 699.67 570.52
333.58 415.74 408.92 517.48 617.19 510.69 665.26 438.42 696.17 595.87 759.03 882.64 773.01
Consider first the pairs (I, S) for the English texts: Evidently, they lay on a straight line S = 3.9868+1.1785I (with R2 = 0.89), as is shown in Figure 9.2.
Figure 9.2: Ord’s criterion for English texts
156
Distribution models
As can easily be seen, all English texts lay below the S = 2I − 1 line, i.e. in the domain of the negative hypergeometric distribution. In order to see how rank frequency distributions are patterned, we present the complete Table of (I, S) values (cf. Table 9.18). Table 9.18: Ord’s scheme for rank frequency distributions of 145 texts in 20 languages Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09
N
V
I
S
Text ID
N
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495 1176 1597 985 1574
130.18 61.68 88.02 88.04 74.32 127.46 99.15 50.65 98.94 101.26 202.00 176.48 477.32 93.45 187.88 295.61 315.33 119.47 79.27 214.75 347.59 297.73 377.96 449.07 576.15 434.76 630.13 347.63 618.19
106.94 47.91 70.82 63.88 54.47 94.64 74.43 31.06 74.87 75.13 134.25 135.43 421.71 64.93 138.96 231.24 292.08 89.22 61.35 161.96 333.58 415.74 408.92 517.48 617.19 510.69 665.26 438.42 696.17
Kn-18 Kn-19 Kn-20 Kn-21 Kn-22 Kn-23 Kn-30 Kn-31 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-15 Mr-16 Mr-17
4483 1787 4556 1455 4554 4685 4499 4672 345 1633 809 219 3311 4010 4931 4285 1354 829 2062 1187 1437 1409 3635 2330 451 1509 4693 3642 4170
V
I
S
1782 602.80 611.83 833 262.27 243.35 1755 594.81 613.10 790 248.93 198.50 1764 616.41 628.27 1738 595.65 631.28 2005 669.57 628.93 1920 658.87 656.52 174 55.34 49.05 479 180.17 204.18 272 95.81 106.26 116 35.05 29.79 2211 651.17 375.00 2334 771.62 538.25 2703 884.67 679.58 1910 668.66 613.81 909 263.56 153.23 609 165.91 77.42 396 128.53 175.38 281 91.60 118.17 273 85.42 118.01 302 99.98 135.71 515 161.89 242.83 289 81.45 117.69 143 45.91 54.23 301 99.12 125.91 1947 667.55 656.20 1831 604.45 512.69 1853 609.55 577.33 (continued on next page)
Ord’s criterion
157
Table 9.18 (continued from previous page) Text ID
N
V
I
S
Text ID
N
V
I
S
E-10 6246 1333 481.67 595.87 Mr-18 4062 1788 614.74 576.60 E-11 8193 1669 606.86 759.03 Mr-20 3943 1725 574.54 548.87 E-12 9088 1825 699.67 882.64 Mr-21 3846 1793 585.38 536.01 E-13 11625 1659 570.52 773.01 Mr-22 4099 1703 600.03 582.92 G-01 1095 539 184.88 158.53 Mr-23 4142 1872 626.79 580.68 G-02 845 361 111.39 110.76 Mr-24 4255 1731 571.85 573.14 G-03 500 281 90.56 68.96 Mr-26 4146 2038 688.14 590.29 G-04 545 269 85.29 75.52 Mr-30 5054 2911 973.46 773.66 G-05 559 332 104.66 73.45 Mr-31 5105 2617 878.52 720.08 G-06 545 326 99.03 70.68 Mr-32 5195 2382 832.17 748.48 G-07 263 169 48.40 31.89 Mr-33 4339 2217 715.87 603.95 G-08 965 509 167.21 135.40 Mr-34 3489 1865 599.92 482.41 G-09 653 379 121.10 88.31 Mr-40 5218 2877 950.94 719.60 G-10 480 301 90.17 59.85 Mr-43 3356 1962 616.01 444.12 1738 843 286.54 247.10 G-11 468 297 86.93 57.50 R-01 G-12 251 169 48.65 28.69 R-02 2279 1179 411.83 325.63 1264 719 235.94 172.52 G-13 460 253 75.60 61.31 R-03 G-14 184 129 35.65 19.50 R-04 1284 729 236.94 175.60 G-15 593 378 112.69 73.02 R-05 1032 567 184.11 141.55 G-16 518 292 87.13 70.22 R-06 695 432 134.57 88.15 968 223 74.65 97.78 G-17 225 124 34.78 29.75 Rt-01 H-01 2044 1079 339.16 238.99 Rt-02 845 214 71.61 93.40 H-02 1288 789 252.99 163.90 Rt-03 892 207 61.16 80.91 H-03 403 291 84.11 38.29 Rt-04 625 181 58.03 71.62 H-04 936 609 189.70 110.45 Rt-05 1059 197 53.48 74.61 H-05 413 290 84.70 42.42 Ru-01 2595 1240 446.37 379.64 Hw-01 282 104 28.81 33.25 Ru-02 17205 6073 2449.60 2432.38 Hw-02 1829 257 62.81 72.38 Ru-03 3853 1792 657.99 566.26 Hw-03 3507 521 170.55 247.35 Ru-04 753 422 133.38 102.11 6025 2536 945.41 875.42 Hw-04 7892 744 71.69 107.17 Ru-05 Hw-05 7620 680 206.36 342.10 Sl-01 756 457 143.80 98.00 Hw-06 12356 1039 337.57 548.71 Sl-02 1371 603 206.91 195.94 I-01 11760 3667 1468.52 1559.31 Sl-03 1966 907 318.06 281.75 I-02 6064 2203 862.78 862.44 Sl-04 3491 1102 421.06 449.07 I-03 854 483 159.22 117.15 Sl-05 5588 2223 838.40 803.32 I-04 3258 1237 456.67 455.77 Sm-01 1487 266 87.88 118.32 I-05 1129 512 172.76 159.81 Sm-02 1171 219 67.10 90.38 (continued on next page)
158
Distribution models
Table 9.18 (continued from previous page) Text ID
N
V
I
S
Text ID
N
V
I
In-01 In-02 In-03 In-04 In-05 Kn-01 Kn-02
376 373 347 343 414 3713 4508
221 209 194 213 188 1664 1738
657.26 611.41 54.38 559.42 54.18 561.00 582.68
496.88 493.30 449.06 409.04 561.12 522.19 605.49
Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
617 736 447 1551 1827 2054
140 153 124 611 720 645
S
41.17 59.01 46.11 64.58 37.86 49.32 240.25 228.73 285.01 268.62 259.13 275.85
All data points represented in Table 9.18 are graphically illustrated in the plot of Figure 9.3. The common straight line for the data in Figure 9.3 is S = −3.3944 + 0.9591I with R2 = 0.95. This means that the (I, S)-relation of rank frequencies of words is a straight line.
Observed data for ranks Ord's line S = 2*I - 1
5000
4000
S 3000
2000
1000
0 0
500
1000
1500
2000
2500
I
Figure 9.3: (I, S) points for rank frequency distributions from Table 9.18
For every language we separately compute the linear regression S = a + bI which seems to be a correct relation if we take into account the result represented in Figure 9.3. For individual languages we obtain the parameters and the determination coefficient R2 as given in Table 9.19.
Ord’s criterion
159
Table 9.19: Ord’s lines for ranks in 20 languages (ranking by slope b) Language Marathi Hungarian German Tagalog Latin Bulgarian Romanian Indonesian Czech Slovene Russian Italian Kannada English Lakota Rarotongan Samoan Marquesan Maori Hawaii
a
b
R2
184.5944 −27.5158 0.3981 43.0187 −62.2284 −8.4465 −23.2738 −4.2114 −28.052 −20.7989 −70.7551 −45.4271 −53.8527 3.9868 −14.2285 5.3924 2.3813 −7.1924 −30.143 −22.3445
0.6954 0.7696 0.7791 0.8212 0.8253 0.8466 0.8659 0.9112 0.9306 1.0020 1.0166 1.0839 1.0899 1.1785 1.2179 1.2271 1.3198 1.4103 1.6581 1.6918
0.76 ≈1 0.89 0.53 0.91 0.97 0.98 0.97 0.98 0.99 ≈1 ≈1 0.98 0.89 ≈1 0.93 0.99 0.95 0.99 ≈1
Almost all (I, S) points (cf. Table 9.18) are located below Ord’s line S = 2I − 1. As can be seen, except for Indonesian, all languages display a very regular behaviour of (I, S). In Indonesian there was an outlier disturbing the relationship (yielding R2 = 0.01). We assumed that five texts were not enough and that analyses of further texts would eliminate the influence of the outlier point. To this end we took three additional texts from Internet (www.TokohIndonesia.com), namely biographies of writers Mochtar Lubis (N = 1432, V = 787, I = 242.57, S = 191.41), Ajip Rosidi (N = 1683, V = 669, I = 222.61, S = 226.85) and P.A. Toer (N = 1543, V = 730, I = 239.44, S = 214.68), and obtained a straight line S = −4.2114 + 0.9112I yielding R2 = 0.97. The weight of the outlier disappeared, corroborating the assumption of linear dependence. A similar case can be found in Tagalog, for which we had only 3 texts. The individual lines are shown in Figure 9.4. Since different line types had not distinguished the languages sufficiently, we gave in Figure 9.4 the names of languages in the order of lines. For the
160
Distribution models
sake of simplicity, we presented the straight lines as S − a = 2I. As can again be seen, by and large the analytic languages have a steeper slope than the more synthetic ones. However, this conclusion must be tested on much more texts and languages. Besides, all lines lie under Ord’s line. This means that rank frequency distributions lie in the area of the negative hypergeometric distribution in a very orderly way. As to Ord’s criterion, if a theoretical distribution is of type X then it lies in the area of X, but if an empirical distribution lies in the area of X, it need not be necessarily of type X because the areas (we did not draw all areas) overlap.
Figure 9.4: Ord’s lines for rank frequencies in 20 languages
For spectra we present the results in Table 9.20; they are quite different but agree with our assumption.
Ord’s criterion
161
Table 9.20: Ord’s scheme for spectra of 145 texts in 20 languages Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01
N
V
I
S
Text ID
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8193 9088 11265 1095
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495 1176 1597 985 1574 1333 1669 1825 1659 530
5.12 2.41 2.61 2.75 2.83 4.75 2.56 1.32 3.34 4.51 5.71 5.92 22.38 2.36 10.20 14.44 18.06 3.68 3.72 6.21 20.69 26.31 41.24 60.40 46.03 79.64 45.14 92.52 54.36 94.61 101.08 104.08 127.20 10.75
23.14 7.38 8.15 11.40 10.21 19.24 9.94 5.14 12.33 18.38 37.88 33.80 113.35 10.78 60.95 72.53 83.51 18.61 18.86 27.84 81.43 92.68 149.90 255.26 183.94 324.05 155.47 341.65 202.44 385.56 402.53 387.00 466.87 53.86
Kn-8 Kn-9 Kn-20 Kn-21 Kn-22 Kn-23 Kn-30 Kn-31 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-15 Mr-16 Mr-17 Mr-18 Mr-20 Mr-21 Mr-22 Mr-23
N
V
I
S
4483 1782 13.26 80.17 1787 833 5.12 23.46 4556 1755 14.91 92.44 1455 790 5.06 28.82 4554 1764 16.46 103.59 4685 1738 13.98 70.33 4499 2005 12.34 96.34 4672 1920 10.56 53.61 345 174 3.56 10.73 1633 479 26.21 72.67 809 272 13.58 37.14 219 116 2.82 10.03 3311 2211 7.52 97.95 4010 2334 13.59 132.74 4931 2703 7.63 51.43 4285 1910 12.37 57.25 1354 909 2.50 17.74 829 609 1.41 9.38 2062 396 38.07 99.76 1187 281 24.47 78.11 1436 273 31.10 81.62 1409 302 28.55 79.34 3635 515 59.27 150.25 2330 289 48.93 146.68 451 143 9.90 25.10 1509 301 52.40 161.34 4693 1947 12.71 71.87 3642 1831 6.90 33.96 4170 1853 7.71 35.89 4062 1788 11.44 65.73 3943 1725 7.83 32.27 3846 1793 6.22 26.88 4099 1703 14.07 73.39 4142 1872 8.36 37.06 (continued on next page)
162
Distribution models
Table 9.20 (continued from previous page) Text ID
N
V
I
S
Text ID
G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01 In-02 In-03 In-04 In-05
845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2944 1288 403 936 413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376 373 347 343 414
361 281 269 332 326 169 509 379 301 297 169 253 129 378 292 124 1079 789 291 609 290 103 256 521 744 680 1039 3667 2203 483 1237 512 221 209 194 213 188
6.78 4.34 4.68 4.09 3.09 1.82 5.50 3.85 2.43 2.05 1.65 2.56 1.21 2.10 2.07 1.64 29.85 15.48 6.21 9.37 3.94 4.13 8.84 67.48 32.84 129.75 226.18 61.92 38.67 1.70 19.88 7.25 2.11 2.25 1.70 1.05 2.58
30.61 19.51 18.97 18.71 17.12 9.82 23.72 16.67 11.03 10.15 7.56 9.88 5.53 9.02 6.92 5.17 189.82 109.96 41.68 59.28 24.26 10.12 21.42 183.48 73.73 275.16 545.29 239.65 154.92 6.40 73.81 25.29 8.19 9.11 6.40 4.87 6.36
Mr-24 Mr-26 Mr-30 Mr-31 Mr-32 Mr-33 Mr-34 Mr-40 Mr-43 R-01 R-02 R-03 R-04 R-05 R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02
N
V
I
S
4255 1731 9.28 40.48 4146 2038 8.37 42.24 5054 2911 8.05 44.18 5105 2617 8.43 44.07 5195 2382 10.56 46.91 4339 2217 6.09 34.84 3489 1865 4.64 19.98 5218 2877 7.44 41.26 3356 1962 4.04 20.48 1738 846 8.36 36.62 2279 1179 13.09 67.38 1264 719 6.51 38.80 1284 729 5.33 28.01 1032 567 5.39 25.48 695 432 3.36 15.80 968 223 28.47 77.52 845 214 17.79 41.07 892 207 14.99 39.30 625 181 12.17 31.23 1059 197 16.09 39.29 2595 1240 15.95 84.47 17205 6073 81.71 423.59 3853 1792 17.24 77.60 753 422 41.92 19.04 6025 2536 24.43 121.12 756 457 5.04 30.84 1371 603 8.48 35.18 1966 907 14.85 69.12 3491 1102 55.99 230.53 5588 2223 27.59 121.07 1487 266 40.48 100.63 1171 219 26.76 65.97 617 140 13.10 24.17 736 153 20.33 47.62 447 124 9.82 20.96 1551 611 23.01 65.24 1827 720 28.15 83.68 (continued on next page)
Ord’s criterion
163
Table 9.20 (continued from previous page) Text ID
N
V
I
S
Text ID
Kn-01 Kn-02
3713 4508
1664 1738
13.25 11.71
94.53 56.63
T-03
N
V
I
S
2054
645
33.92
86.72
The overall formula to cover the tendency as expressed by the data from Table 9.20 is still a straight line S = 15.5268 + 3.0453I with R2 = 0.88 as shown in Figure 9.5 (see p. 163) but there are greater divergences between languages.
Observed data for spectra Ord's line S = 2*I - 1 500
400
S 300
200
100
0 0
50
100
150
200
250
I
Figure 9.5: The (I, S) points for spectra from Table 9.20
Again, we obtain linear dependences for individual languages; the relevant data are represented in Table 9.21.
164
Distribution models
Table 9.21: Ord’s lines for spectra in 20 languages (ranking by slope b) Language
a
b
R2
Indonesian Tagalog Maori Hawaii Samoan Lakota Rarotonga Marquesan English Italian Slovene Bulgarian Czech German Romanian Russian Marathi Hungarian Kannada Latin
3.5051 23.5422 17.9157 0.4188 −7.5433 1.5138 −5.8530 −6.4194 0.6461 −1.6771 7.7837 −2.7309 2.8556 −1.4108 −0.3358 −5.1935 −5.3390 1.8490 −9.6387 3.1605
1.7961 1.9395 2.2016 2.3586 2.6886 2.6969 2.8787 3.1680 3.8289 3.9287 4.0002 4.7397 4.8506 4.9205 5.0929 5.2382 5.6441 6.4110 6.8265 7.7194
0.40 0.83 0.98 0.99 1.00 1.00 0.98 1.00 0.99 ≈1 1.00 0.96 0.98 0.98 0.95 1.00 0.91 0.99 0.84 0.66
The graph of this line is shown in Figure 9.6 (see p. 165). In this figure we see again that by and large the steeper the slope, the more synthetic the language. Adding further texts, the slopes will surely change but we can conclude that Ord’s criterion can be used also for spectra as a typological characteristic. If one compares Figure 9.6 with Table 9.21, one sees that with three exceptions all lines lie above Ord’s line.3 Both dependencies (for ranks and spectra) show a clear patterning of frequency distributions. If one plots the (I, S)-point for all languages in a graph, one can easily see that they all lie on a straight line with a very small disper3. Two of them, Indonesian and Tagalog, have very bad determination coefficients. But using the additional Indonesian data for spectra (Mochtar Lubis (N = 1432, V = 787, I = 4.43, S = 26.34), Ajip Rosidi (N = 1683, V = 669, I = 8.40, S = 28.28) and P.A. Toer (N = 1543, V = 730, I = 7.40, S = 33.63)) we obtain the dependence S = 0.4546 + 3.9954I with R2 = 0.86 showing that, again, we had to do with an outlier whose weight disappears adding further data. There is, of course, a possibility that different genres produce clouds of points, a problem we leave to the reader. Experiments with Tagalog are also left to the reader. Maori lies exactly on Ord’s line.
Repeat rate and entropy
165
Figure 9.6: Ord’s lines for spectra in 20 languages
sion. The rank frequency data lie much below Ord’s straight line (S = 2I − 1), the spectrum data lie high above this line. But the distributions are controlled in such a way that having the rank frequency (I, Sr )-point, one can easily predict the Ss of the spectrum. If Sr = a + bI and Ss = c + dI, then Ss = c + d(Sr − a)/b. This works, of course, only because the relations are “very” linear.
9.7
Repeat rate and entropy
There are different possibilities of characterizing the diversity of words in a text. In texts, we speak about richness whereas in other sciences one speaks of diversity, dispersion, uncertainty of choice, evenness, homogeneity, abundance, complexity, disorder, etc. (cf. e.g. Beisel & Usseglio-Polatera & Bachmann & Moretau 2003). In linguistics, two indices found popularity, viz. the repeat rate and Shannon’s entropy.
166
Distribution models
The repeat rate is defined as K
RR =
∑ p2i
(9.26)
i=1
where pi are the individual probabilities. If we estimate them by means of relative frequencies, then pi = fi /N, where fi are the absolute frequencies and N is the sample size, in practice RRr =
1 V 2 ∑ fi N 2 i=1
(9.27)
RRs =
1 W 2 ∑ gi , V 2 i=1
(9.28)
and
where gi are the absolute frequencies of the spectrum. The repeat rate can be interpreted in different ways. Geometrically, if we consider pi as a coordinate of text in a V -dimensional space or gi in a W dimensional space respectively, then RR is the distance of text from the origin. Mostly it is interpreted as a measure of concentration. The more a text is concentrated to one class, e.g. class 1 in the spectrum, the closer RR to 1. Usually the rank frequency distribution is not as concentrated as the spectrum, hence RRr < RRs . A full concentration of the spectrum would mean maximal vocabulary richness (each word would occur exactly the same number of times, usually once). A full concentration of rank frequency distribution would mean that the text has no richness, one single word is repeated incessantly. If all frequencies are distributed equally, we have the opposite situation: the rank frequency distribution displays maximal vocabulary richness, the spectrum displays minimal vocabulary richness. The maximum in both variants of the repeat rate is 1, with opposite interpretations. The minimum for the rank frequency case is 1 V N 2 1 RRr,min = 2 ∑ (9.29) = N i=1 V V and the minimum for the spectrum case is 1 1 W V 2 RRs,min = 2 ∑ = , V i=1 W W
(9.30)
Repeat rate and entropy
167
hence RRr ∈ (1/V, 1) and RRs ∈ (1/W, 1). Since in word frequency distributions the values of RR are very small, one takes sometimes relative values obtained as 1 − RRr RRr,rel = (9.31) 1 − 1/V and RRs,rel =
1 − RRs 1 − 1/W
(9.32)
but McIntosh (1967) proposed another form, namely RRr,rel
√ 1 − RRr √ = 1 − 1/ V
(9.33)
RRs,rel
√ 1 − RRs √ . = 1 − 1/ W
(9.34)
and, analogously,
Let us consider an example. For the Czech text Cz-10 we obtain for the rank frequency distribution (N = 1156, V = 638) RRr = 0.0693. The relative repeat rate (9.31) yields RRr,rel = (1 − 0, 0693) / (1 − 1/638) = 0.9322. The relative repeat rate of McIntosh (9.33) yields √ √ RRr,rel = 1 − 0.0693/1 − 1/ 638) = 0.7368. For the spectrum we have (V = 638, W = 19) RRs = 0.6508 and the relative repeat rate (9.32) yields RRs,rel = (1 − 0.6508) / (1 − 1/19) = 0.3686, while the McIntosh variant yields √ √ RRs,rel = (1 − 0.6508)/(1 − 1/ 19) = 0.2508 It is a matter of convenience which formula one takes. For comparing two repeat rates we set up an asymptotic test using the variance of the repeat rate. The simplest way to derive the variance is to use
168
Distribution models
a Taylor expansion (cf. Kendall & Stuart 1969, I, 232) and set (RR∗ and pˆ are the empirical values) ∂ RR∗ 2 ∗ 2 Var(RR ) = Var(∑ pˆi ) = ∑ Var( pˆi ) + ∂ pˆi p i i (9.35) i ∂ RR∗ ∂ RR∗ Cov( pˆi , pˆ j ). ∑ ∑ ∂ pˆi ∂ pˆ j pi p j i 6= j Since pi (1 − pi ) , N pi p j Cov( pˆi , pˆ j ) = − N Var( pˆi ) =
(9.36)
and
∂ RR∗ = 2 pˆi , ∂ pˆi inserting these expression in (9.35) we obtain at last 4 Var(RR ) = N ∗
∑
p3i
i
− RR
2
!
.
(9.37)
The theoretical values are estimated by means of the observed ones. If we compute the variance of the repeat rate of the spectrum, we must replace N in (9.37) by V . Let us illustrate the use of Var(RR) comparing the repeat rates of spectra of texts R-01 (Eminescu, Luceafarul) and R-02 (Eminescu & Scrisoarea III). We find the values given in Table 9.22. Table 9.22: Repeat rates of spectra of Eminescu’s texts R-01 and R-02
V
RR
∑ p3i
Var(RR)
i
R-01 843 0.5433 0.3766 0.000386 R-02 1179 0.6125 0.4604 0.000289 Inserting these values in the asymptotic normal criterion RR1 − RR2 z = p Var(RR1 ) + V (RR2 )
Repeat rate and entropy
we obtain z = √
169
0.5433 − 0.6125 = −2.66 0.000386 + 0.000289
i.e. the repeat rates of these two texts differ significantly. The richness in text R-02 is greater than in text R-01. In the same way one can show that the difference between the repeat rates of rank frequency distributions of both texts is not significant (z = −0.9). In Table 9.23 we present the repeat rates of all rank frequency distributions of all texts. Table 9.23: Repeat rates for rank frequency distribution of texts Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04
N
V
761 352 515 483 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232
RR ranks 0.0092 0.0012 0.0086 0.0092 0.0112 0.0095 0.0076 0.0105 0.0093 0.0113 0.0070 0.0078 0.0086 0.0076 0.0120 0.0101 0.0101 0.0080 0.0192 0.0069 0.0099 0.0098 0.0137 0.0139
Text ID Kn-18 Kn-19 Kn-20 Kn-21 Kn-22 Kn-23 Kn-30 Kn-31 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01
N
V
RR ranks
4485 1782 0.0035 1787 833 0.0041 4556 1755 0.0038 1455 790 0.0047 4554 1794 0.0042 4685 1738 0.0036 4499 2005 0.0032 4672 1920 0.0028 345 174 0.0160 1633 479 0.0181 809 272 0.0204 219 116 0.0214 3311 2211 0.0027 4010 2334 0.0038 4931 2703 0.0019 4285 1910 0.0034 1354 909 0.0030 829 609 0.0033 2062 396 0.0209 1187 281 0.0241 1436 273 0.0252 1409 302 0.0235 3635 515 0.0182 2330 289 0.0244 (continued on next page)
170
Distribution models
Table 9.23 (continued from previous page) Text ID
N
V
E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06
4760 4862 5004 5083 5701 6246 8193 9088 11625 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936 413 282 1829 3507 7892 7620 12356
1495 1176 1597 985 1574 1333 1669 1825 1659 539 361 281 269 332 326 169 509 379 301 297 169 253 129 378 292 124 1079 789 291 609 290 104 257 521 744 680 1039
RR ranks 0.0103 0.0172 0.0096 0.0192 0.0102 0.0159 0.0129 0.0120 0.0119 0.0117 0.0108 0.0122 0.0123 0.0103 0.0087 0.0128 0.0077 0.0085 0.0021 0.0078 0.0125 0.0095 0.0144 0.0062 0.0074 0.0153 0.0155 0.0133 0.0188 0.0117 0.0130 0.0243 0.0206 0.0211 0.0218 0.0185 0.0193
Text ID Mq-02 Mq-03 Mr-15 Mr-16 Mr-17 Mr-18 Mr-20 Mr-21 Mr-22 Mr-23 Mr-24 Mr-26 Mr-30 Mr-31 Mr-32 Mr-33 Mr-34 Mr-40 Mr-43 R-01 R-02 R-03 R-04R-05 R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02
N
V
RR ranks
451 143 0.0288 1509 301 0.0379 4693 1947 0.0032 3642 1831 0.0024 4170 1853 0.0024 4062 1788 0.0034 3943 1725 0.0026 3846 1793 0.0022 4099 1703 0.0040 4142 1872 0.0026 4255 1731 0.0028 4146 2038 0.0025 5054 2911 0.0018 5105 2617 0.0020 5195 2382 0.0024 4339 2217 0.0019 3489 1865 0.0019 5218 2877 0.0018 3356 1962 0.0017 1738 843 0.0060 2279 1179 0.0066 1264 719 0.0065 1284 729 0.0055 1032 567 0.0070 695 432 0.0072 968 223 0.0338 845 214 0.0256 892 207 0.0216 625 181 0.0249 1059 197 0.0202 2595 1240 0.0069 17205 6073 0.0049 3853 1792 0.0050 753 422 0.0079 6025 2536 0.0044 756 457 0.0088 1371 603 0.0078 (continued on next page)
Repeat rate and entropy
171
Table 9.23 (continued from previous page) Text ID
N
V
I-01 I-02 I-03 I-04 I-05 In-01 In-02 In-03 In-04 In-05 Kn-01 Kn-02
11760 6064 854 3258 1129 376 373 347 343 414 3713 4508
3667 2203 483 1237 512 221 209 194 213 188 1664 1738
RR ranks 0.0055 0.0068 0.0106 0.0069 0.0084 0.0101 0.0108 0.0100 0.0077 0.0115 0.0042 0.0032
Text ID
N
V
Sl-03 Sl-04 Sl-05 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
907 1102 2223 266 219 140 153 124 611 720 645
RR ranks 0.0086 0.0169 0.0054 0.0309 0.0273 0.0282 0.0340 0.0299 0.0165 0.0167 0.0180
Table 9.24 presents the repeat rates of all spectra. Table 9.24: Repeat rates for spectra of all texts Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07
N
V
761 352 515 983 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862
RR spectra 0.5758 0.6018 0.5760 0.6221 0.6367 0.6441 0.6014 0.6670 0.6137 0.6101 0.6703 0.5971 0.5919 0.5828 0.6549 0.6349 0.5469
Text ID Kn-18 Kn-19 Kn-20 Kn-21 Kn-22 Kn-23 Kn-30 Kn-31 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05
N
V
RR spectra
4483 1782 0.4551 1787 833 0.4639 4556 1755 0.4424 1455 790 0.5547 4554 1764 0.4731 4685 1738 0.4536 4499 2005 0.4916 4672 1920 0.4944 345 174 0.5564 1633 479 0.4358 809 272 0.4426 219 116 0.5158 3311 2211 0.6259 4010 2334 0.6608 4931 2703 0.5937 4285 1910 0.5287 1354 909 0.6727 (continued on next page)
172
Distribution models
Table 9.24 (continued from previous page) Text ID Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04
N
V
677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8139 9088 11265 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593 518 225 2044 1288 403 936
389 259 638 939 1017 1001 1232 1495 1176 1597 985 1574 1333 1669 1825 1659 530 361 281 269 332 326 169 509 379 301 297 169 253 129 378 292 124 1079 789 291 609
RR spectra 0.5671 0.5553 0.6508 0.5243 0.5418 0.4224 0.3603 0.4529 0.3418 0.4814 0.2824 0.4386 0.3181 0.3176 0.3735 0.2450 0.5787 0.4166 0.6301 0.5058 0.6330 0.6029 0.6322 0.5786 0.6491 0.6398 0.6302 0.7121 0.5224 0.7084 0.6621 0.5646 0.4974 0.6279 0.6698 0.8025 0.7129
Text ID Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-15 Mr-16 Mr-17 Mr-18 Mr-20 Mr-21 Mr-22 Mr-23 Mr-24 Mr-26 Mr-30 Mr-31 Mr-32 Mr-33 Mr-34 Mr-40 Mr-43 R-01 R-02 R-03 R-04 R-05 R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05
N
V
RR spectra
829 609 0.7417 2062 396 0.2916 1187 281 0.3198 1436 273 0.2680 1409 303 0.3483 3635 515 0.2618 2330 289 0.1592 451 143 0.3686 1509 301 0.2589 4693 1947 0.4871 3642 1831 0.5459 4170 1853 0.4761 4062 1788 0.5111 3943 1725 0.4804 3846 1793 0.4937 4099 1703 0.5124 4142 1872 0.4964 4255 1731 0.4396 4146 2038 0.5526 5054 2911 0.5912 5105 2617 0.5674 5195 2382 0.5445 4339 2217 0.5359 3489 1865 0.5616 5218 2877 0.6072 3356 1962 0.6085 1738 843 0.5433 2279 1179 0.6125 1264 719 0.6383 1284 729 0.6332 1032 567 0.5856 695 432 0.6827 968 223 0.3538 845 214 0.3869 892 207 0.2707 625 181 0.3580 1059 197 0.1974 (continued on next page)
Repeat rate and entropy
173
Table 9.24 (continued from previous page) Text ID
N
V
H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01 In-02 In-03 In-04 In-05 Kn-01 Kn-02
413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376 373 347 343 414 3713 4508
290 104 257 521 744 680 1039 3667 2203 483 1237 512 221 209 194 213 188 1664 1738
RR spectra 0.7571 0.3428 0.3996 0.2752 0.1885 0.2338 0.2630 0.4935 0.5467 0.6415 0.4963 0.5093 0.5890 0.5266 0.4892 0.5080 0.4449 0.4995 0.4425
Text ID
N
V
Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
2595 17205 3853 753 6025 756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
1240 6073 1792 422 2536 457 603 907 1102 2223 266 219 140 153 124 611 720 645
RR spectra 0.5992 0.5435 0.5975 0.5835 0.5522 0.6504 0.5182 0.5400 0.4397 0.5355 0.2565 0.2318 0.3210 0.2875 0.3398 0.5964 0.5868 0.5575
There are many definitions of entropy (cf. e.g. Esteban & Morales 1995) and in all sciences it means something else. Usually one uses it as a measure of diversity, uncertainty etc. All versions can be brought to a common function of which the individual formulas are special cases. Here we shall use the Shannon’s entropy defined as K
H = − ∑ pi ld pi
(9.38)
i=1
where pi are the individual probabilities (estimated by relative frequencies), K is the inventory size (V with rank frequency and W with spectrum) and ld is the logarithm to the base 2. Frequently one finds the natural logarithm ln which can be transformed taking ld x = ln x/ ln 2. In word frequency studies one considers H as a measure of diversity, inhomogeneity, uncertainty etc. In rank frequency cases, the smaller H, the more is the vocabulary concentrated to a few words resulting in a small vocabulary richness; the greater it is, the more even is the distribution of words (the more of them occur only once). If
174
Distribution models
all frequencies are concentrated in one word, then 1
H = − ∑ 1 ld 1 = 0 i=1
. If all words have the same frequency, viz. 1/V , then we have V
1 1 ld = ldV V i=1 V
H = −∑ .
Hence we get for rank frequencies H ∈ (0, ldV ) and for the spectrum H ∈ (0, ldW ). Frequently, especially by comparing samples with very different inventory sizes, one uses the relative entropy Hr,rel = H/ldV for rank frequency and Hs,rel = H/ldW for the spectrum. With the spectrum the interpretation is the other way round: the smaller the entropy, the greater the richness (because most words occur only once) and the greater H, the smaller the richness. The computation of entropies for the rank frequency distribution is presented in Table 9.25, for the spectra in Table 9.26 (pp. 177ff.). Table 9.25: Entropy for rank frequency distributions Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B-10 Cz-01
N
V
H ranks
Text ID
761 352 515 483 406 687 557 268 550 556 1044
400 201 285 286 238 388 324 179 313 317 638
7.8973 7.0994 7.5827 7.5980 7.3055 7.8501 7.7944 7.1070 7.6576 7.6055 8.6163
Kn-18 Kn-19 Kn-20 Kn-21 Kn-22 Kn-23 Kn-30 Kn-31 Lk-01 Lk-02 Lk-03
N
V
H ranks
4485 1782 9.7515 1787 833 8.9712 4556 1755 9.6909 1455 790 8.938 4554 1794 9.6289 4685 1738 9.6444 4499 2005 10.0072 4672 1920 9.8862 345 174 6.7685 1633 479 7.3035 809 272 6.8508 (continued on next page)
Repeat rate and entropy
175
Table 9.25 (continued from previous page) Text ID Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15
N
V
H ranks
Text ID
984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8193 9088 11625 1095 845 500 545 559 545 263 965 653 480 468 251 460 184 593
543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495 1176 1597 985 1574 1333 1669 1825 1659 539 361 281 269 332 326 169 509 379 301 297 169 253 129 378
8.3282 8.9529 7.8770 8.1959 8.6111 8.4876 7.9987 7.4120 8.4876 8.5197 8.3972 8.2471 8.4634 8.7676 8.2191 8.8057 7.9010 8.6865 8.3391 8.5906 8.5717 8.4674 8.0326 7.7006 7.4369 7.3530 7.7183 7.7918 6.9781 8.2157 7.9035 7.7245 7.7563 6.9814 7.4490 6.6629 8.0810
Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-15 Mr-16 Mr-17 Mr-18 Mr-20 Mr-21 Mr-22 Mr-23 Mr-24 Mr-26 Mr-30 Mr-31 Mr-32 Mr-33 Mr-34 Mr-40 Mr-43 R-01 R-02 R-03 R-04 R-05
N
V
H ranks
219 116 6.2882 3311 2211 10.5032 4010 2334 10.2814 4931 2703 10.5934 4285 1910 9.8252 1354 909 9.3625 829 609 8.4581 2062 396 6.9856 1187 281 6.7198 1436 273 6.5851 1409 302 6.6909 3635 515 7.1346 2330 289 6.6095 451 143 6.1063 1509 301 6.5012 4693 1947 9.8764 3642 1831 10.012 4170 1853 9.9799 4062 1788 9.7898 3943 1725 9.8472 3846 1793 9.9948 4099 1703 9.6097 4142 1872 9.9538 4255 1731 9.8062 4146 2038 10.0913 5054 2911 10.6433 5105 2617 10.4632 5195 2382 10.1882 4339 2217 10.3521 3489 1865 10.1542 5218 2877 10.6589 3356 1962 10.2964 1738 843 8.7903 2279 1179 9.1346 1264 719 8.7035 1284 729 8.7736 1032 567 8.3954 (continued on next page)
176
Distribution models
Table 9.25 (continued from previous page) Text ID
N
V
H ranks
Text ID
N
V
H ranks
G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01 In-02 In-03 In-04 In-05 Kn-01 Kn-02
518 225 2044 1288 403 936 413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376 373 347 343 414 3713 4508
292 124 1079 789 291 609 290 104 257 521 744 680 1039 3667 2203 483 1237 512 221 209 194 213 188 1664 1738
7.6923 6.5269 8.8380 8.6954 7.5293 8.4426 7.6043 6.0083 6.5548 7.0628 6.5388 7.0618 7.2720 9.8671 9.4130 8.1008 8.9123 8.0893 7.2975 7.2140 7.1780 7.4299 6.9893 9.7114 9.7285
R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02 T-03
695 968 845 892 625 1059 2595 17205 3853 753 6025 756 1371 1966 3491 5588 1487 1171 617 736 447 1551 1827 2054
432 223 214 207 181 197 1240 6073 1792 422 2536 457 603 907 1102 2223 266 219 140 153 124 611 720 645
8.1436 6.2661 6.3747 6.542 6.3644 6.5085 9.1104 10.5714 9.5531 8.0561 9.9181 8.1613 8.2723 8.7048 8.2855 9.6509 6.3481 6.3632 5.9515 5.9275 5.8972 7.6919 7.8474 7.5103
Repeat rate and entropy
177
Table 9.26: Entropy for text spectra Text ID B-01 B-02 B-03 B-04 B-05 B-06 B-07 B-08 B-09 B10 Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10 E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13 G-01
N
V
761 352 515 983 406 687 557 268 550 556 1044 984 2858 522 999 1612 2014 677 460 1156 2330 2971 3247 4622 4760 4862 5004 5083 5701 6246 8139 9088 11265 1095
400 201 285 286 238 388 324 179 313 317 638 543 1274 323 556 840 862 389 259 638 939 1017 1001 1232 1495 1176 1597 985 1574 1333 1669 1825 1659 530
H spectra 1.4725 1.3763 1.4807 1.3051 1.2729 1.2496 1.3810 1.1324 1.3081 1.3003 1.1517 1.3683 1.4559 1.3298 1.2232 1.2712 1.6084 1.3897 1.4539 1.2578 1.6853 1.7416 2.0406 2.3397 1.9698 2.4092 1.8753 2.7468 2.0699 2.5950 2.6323 2.4596 3.0631 1.4542
Text ID Kn-18 Kn-19 Kn-20 Kn-21 Kn-22 Kn-23 Kn-30 Kn-31 Lk-01 Lk-02 Lk-03 Lk-04 Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06 M-01 M-02 M-03 M-04 M-05 Mq-01 Mq-02 Mq-03 Mr-15 Mr-16 Mr-17 Mr-18 Mr-20 Mr-21 Mr-22 Mr-23
N
V
H spectra
4483 1782 1.9559 1787 833 1.8273 4556 1755 2.0101 1455 790 1.4948 4554 1764 1.9230 4685 1738 2.0266 4499 2005 1.7912 4672 1920 1.8661 345 174 1.5426 1633 479 2.0506 809 272 2.0157 219 116 1.5920 3311 2211 1.0661 4010 2334 1.1888 4931 2703 1.4039 4285 1910 1.6817 1354 909 1.0968 829 609 0.8842 2062 396 2.8033 1187 281 2.5620 1436 273 2.8554 1409 302 2.5675 3635 515 3.0597 2330 289 3.5209 451 143 2.2475 1509 301 2.7786 4693 1947 1.8592 3642 1831 1.5810 4170 1853 1.8370 4062 1788 1.7482 3943 1725 1.8362 3846 1793 1.7674 4099 1703 1.7798 4142 1872 1.7656 (continued on next page)
178
Distribution models
Table 9.26 (continued from previous page) Text ID
N
V
G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17 H-01 H-02 H-03 H-04 H-05 Hw-01 Hw-02 Hw-03 Hw-04 Hw-05 Hw-06 I-01 I-02 I-03 I-04 I-05 In-01 In-02 In-03 In-04 In-05
845 500 545 559 545 263 965 653 468 480 251 460 184 593 518 225 2044 1288 403 936 413 282 1829 3507 7892 7620 12356 11760 6064 854 3258 1129 376 373 347 343 414
361 281 269 332 326 169 509 379 297 301 169 253 129 378 292 124 1079 789 291 609 290 104 257 521 744 680 1039 3667 2203 483 1237 512 221 209 194 213 188
H spectra 1.9695 1.2979 1.6561 1.2309 0.6824 1.2203 1.4472 1.2539 1.2160 1.2164 1.0321 1.0787 1.0092 1.1787 1.4983 1.6234 1.2741 1.0801 0.6781 0.9396 0.8167 2.2560 2.1569 2.9744 3.3509 3.3693 3.2598 1.8826 1.7029 1.2557 1.8032 1.0179 1.3666 1.5313 1.6310 1.4629 1.9195
Text ID Mr-24 Mr-26 Mr-30 Mr-31 Mr-32 Mr-33 Mr-34 Mr-40 Mr-43 R-01 R-02 R-03 R-04 R-05 R-06 Rt-01 Rt-02 Rt-03 Rt-04 Rt-05 Ru-01 Ru-02 Ru-03 Ru-04 Ru-05 Sl-01 Sl-02 Sl-03 Sl-04 Sl-05 Sm-01 Sm-02 Sm-03 Sm-04 Sm-05 T-01 T-02
N
V
H spectra
4255 1731 1.9831 4146 2038 1.5766 5054 2911 1.4314 5105 2617 1.4988 5195 2382 1.6302 4339 2217 1.6015 3489 1865 1.5108 5218 2877 1.3578 3356 1962 1.3404 1738 843 1.5685 2279 1179 1.3330 1264 719 1.2613 1284 729 1.2975 1032 567 1.3811 695 432 1.1099 968 223 2.4921 845 214 2.3785 892 207 2.7430 625 181 2.3568 1059 197 3.1981 2595 1240 1.4394 17205 6073 1.6859 3853 1792 1.4363 753 422 1.3991 6025 2536 1.6118 756 457 1.1913 1371 603 1.7276 1966 907 1.5871 3491 1102 1.9883 5588 2223 1.6775 1487 266 2.8220 1171 219 2.9834 617 140 2.6250 736 153 2.7400 447 124 2.3792 1551 611 1.4762 1827 720 1.4742 (continued on next page)
Repeat rate and entropy
179
Table 9.26 (continued from previous page) Text ID
N
V
Kn-01 Kn-02
3713 4508
1664 1738
H spectra 1.7621 2.0255
Text ID T-03
N
V
2054
645
H spectra 1.7660
In phonology it has been ascertained that repeat rate and entropy strongly depend on the inventory size of the phonemes (cf. Altmann & Lehfeldt 1980; Zörnig & Altmann 1983, 1984). Do they depend on the vocabulary V or on the number of non-zero classes W in the spectrum? Or do they depend on text length N? These questions cannot be answered definitively but we check here some of the possibilities. Consider first Figure 9.7 representing the (N, RR)-relation of rank frequencies for which no computation is necessary. There is no dependence of RR on N, the values of RR are irregularly dispersed and increasing the number of texts would result in a cloud.
Figure 9.7: The relation of RR for rank frequency to text length N
Hence we can conclude that N has no influence on the building of repeat rate. This can easily be explained by the fact that the proportions of words or word frequency classes remain constant with increasing N. The same holds for the relationship between text length N and RR for spectra (cf. Figure 9.8).
180
Distribution models
Figure 9.8: The relation of RR for spectrum to text length N
Consider now the relationship of RR for rank frequency to the vocabulary V as presented in Figure 9.9.
Figure 9.9: The relationship of RR for rank frequency to V
Repeat rate and entropy
181
One can see a strong dependence: the greater the vocabulary, the smaller the repeat rate because the frequencies must be distributed over more words. However, the dispersion is so large that there is no simple curve capturing the course of the values. Another solution would be more advantageous: if one joins the extreme points (above and on the right side) of the field with an arc, one can argue that no RR will surpass this arc, hence RR rather fills an area than follows a curve. Of course, a much greater number of texts is necessary in order to venture such a statement. The relationship of RR for spectra to V is quite different (cf. Figure 9.10). Here we have rather a kind of a funnel but no clear dependence. It seems that the greater V , the closer RR to the middle of the field. It can of course be tested whether there is a dependence for individual authors or languages. Perhaps this area could also be investigated more conclusively if one had thousands of texts.
Figure 9.10: The relation of RR to V for spectra
Let us turn to the entropies and consider first the possible relationship H = f (N) for rank frequencies, as shown in Figure 9.11. Though in short texts (up to N = 2000) the dependence is very expressed, the dispersion in the domain 4000 ± 1000 is so ample that it would be simpler to use several curves, if at all. The trend does not seem to indicate convergence, but the number of data is too small for more exact statements. Another possibility
182
Distribution models
Figure 9.11: The relations of entropy to N for rank frequencies
would be to separate individual languages, genres, stiles, etc., and examine the dependence separately and hunt for boundary conditions. This is a great field for extensive research. We rather believe that N does not affect the entropy of rank frequencies. As to the relation of entropy of spectra to N, here, too, no dependence is visible, as shown in Figure 9.12. Perhaps languages and
Figure 9.12: Relation of H to N for spectra
Repeat rate and entropy
183
genres, even historical epochs, must be separated and the individual straight lines must be found if there are any. But we do not believe in a dependence of H on N either in rank frequency or in spectrum distributions. Nevertheless, even negative existential hypotheses should be studied. Let us consider the relation of H to V . Since V is, as a matter of fact, the inventory, some relationship should be found. We consider first the relation of Hr to V as shown in Figure 9.13.
Figure 9.13: Dependence of Hr on V (for ranks)
The dependence of H on V for ranks is much more expressed, as expected. There are some outliers, perhaps belonging to one language but the trend can be captured using Hr = aV b . As a matter of fact, all texts together result in Hr = 3.0781V 0.1509 with R2 = 0.85. Eliminating all English texts, we obtain a still better fit. Hence English texts (the special sort of Nobel lectures) have a relatively smaller (rank frequency) entropy than the other texts. Thus it is possible that genres have the same type of curves but with different parameters. As to the dependence of Hs on V (for spectra), we present the points in Figure 9.14. Evidently there is no dependence. The fact that rank frequencies yield a kind of relationship with the inventory, shows the importance of this aspect. Since there is a clear dependence of H on V for rank frequencies, tests for the difference between two Hr ’s must take into account at least this
184
Distribution models
Figure 9.14: The relation of Hs to V for spectra
dependence and cannot consider the expectations equal. That is, the test must take the expectation explicitly into account: Hr,1 − aV1b − (Hr,2 − aV2b ) , z = p Var(Hr,1 ) +Var(Hr,2 )
(9.39)
where a and b must be determined from still more texts than what we have. In the spectra, there is no conspicuous dependence of H on V . In general, the problem of these dependencies is not yet solved and must be followed up both empirically and theoretically. The variance of the entropy can be approximated as follows. Let N be the general sample size, then we get by means of a Taylor expansion !
ˆ =Var − ∑ pˆi ld pˆi Var(H) i
! ln pˆi =Var − ∑ pˆi ln 2 i " # 1 2 = 2 ∑ (1 + ln pi ) Var( pˆi ) + ∑ ∑ (1 + ln pi )(1 + ln p j )Cov( pˆi , pˆ j ) ln 2 i i 6= j
Word classes
185
Inserting the above definitions (9.36) of Var( p) ˆ and Cov( pˆi , pˆ j ), completing the second sum for all i and j and subtracting the added element from the overall sum we obtain finally ! 1 2 2 ˆ = Var(H) pi ld pi . − H , (9.40) N ∑ i For spectra the general sample size N must be substituted by V . As an illustration, let us take two short text spectra, viz. R-05 (Eminescu, Letter V) and E-03 (Marshall, Nobel Lecture). In Table 9.27 we present all necessary computations. The spectrum of R-05 has V = 567, the spectrum of E-03 has V = 1001, the other numbers are in Table 9.27. From the last rows in the Table (9.27) we can compute Var (Hs, R-05 ) = 5.7890 − 1.38122 /567 = 0.0068 Var (Hs, E-03 ) = 8.9073 − 2.04062 /1001 = 0.0047 and the z-criterion yields as usual
1.3812 − 2.0406 z = √ = −6.13 0.0068 + 0.0047 indicating that the spectrum entropy in E-03 is significantly greater than in R-05. We left out the possible dependence of Hs on N since the form of this dependence is not yet clear. It is to be remarked that repeat rate and entropy are only special cases or variants of different other measures (cf. e.g. Esteban & Morales 1995).
9.8
Word classes
In Sections 9.1–9.5 we considered all words of the text and found some text characteristics. The adequacy of the Zeta (or the truncated Zeta) distribution for the rank frequency distribution was a common characteristic of all texts, even though we sometimes selected some other special case of the general theory because of its smaller Chi-square value. However, the Zeta distribution always gave P ≈ 1. If we now consider the words of the text belonging to different classes, e.g. part-of-speech classes, we can scrutinize the distribution of these classes separately. Of course, everything depends on the way we divide the words
186
Distribution models
into classes. A part of speech is nothing categorical – though many linguists stick to this opinion – it is rather a fuzzy set composed of text elements. Some texts or even languages prefer the so-called nominal style, other ones the verbal style, etc. We restrict ourselves to the study of frequency spectra distributions of particular word classes. We start from the assumption that all abide by the same law and each frequency of the spectrum is composed of the respective proportions of all classes. In order to test this hypothesis, it is sufficient to reconstruct the distribution of individual classes ascribing them the original frequency and supposing that all abide by the same regularity. This means that from the class of words occurring once we separate all nouns, verbs, adjectives etc.; then we do the same with the class of words occurring twice, etc., in order to obtain a separate spectrum for all word classes. Taking for example E. Rutherford’s Nobel Lecture we obtain the spectrum for parts of speech as presented in Table 9.28. The majority of distributions used for frequency spectra can be fitted to these data. Usually, the smaller the sample, the more distributions are adequate but the adequacy is not quite persuasive. In order to avoid this multiplicity, we restrict ourselves to the original Zeta distribution fulfilling all requirements: it is part of the general theory; with only one parameter it is very economical; it is adequate in all cases. We shall perform two tests which are equivalent: we fit both the Zeta distribution and the Zeta function to the data in Table 9.28. By doing this we do not go against the etiquette in science; we only approximate the truth in two ways and give the opportunity of continuing this research to those who have only one kind of fitting software. Let us first fit the Zeta distribution. As can be seen, some frequencies are not shown because they are zero. But the probability must be computed also for those cases, because the sum of probabilities must be 1. Our software performs this automatically. The results of fitting the right truncated zeta distribution to the word classes are presented in Tables 9.29 to 9.32. First, Tables 9.29 and 9.30 present the results for nouns and verbs.
Word classes
187
Table 9.27: Computation for the test of difference between two spectrum entropies x
M . Eminescu, R-05 g(x) −pld p pld 2 p
1 2 3 4 5 6 7 8 10 11 12 13 15 16 17 19 26 27 46
425 84 20 11 5 3 1 3 4 1 1 1 1 1 1 2 1 1 1
∑
0.3117 0.4081 0.1702 0.1103 0.0602 0.0400 0.0161 0.0400 0.0504 0.0161 0.0161 0.0161 0.0161 0.0161 0.0161 0.0287 0.0161 0.0161 0.0161
0.1296 1.1244 0.8212 0.6276 0.4108 0.3026 0.1476 0.3026 0.3604 0.1476 0.1476 0.1476 0.1476 0.1476 0.1476 0.2341 0.1476 0.1476 0.1476
1.3811
5.7890
x 1 2 3 4 5 6 7 8 9 10 12 13 14 15 17 18 19 20 21 25 27 33 34 35 42 46 72 77 81 98 106 175 229
G.C. Marshall, E-03 g(x) −pld p
pld 2 p
621 173 76 31 20 15 11 6 4 6 3 3 1 2 3 2 5 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1
0.4273 0.4377 0.2824 0.1552 0.1128 0.0908 0.0715 0.0442 0.0318 0.0442 0.0251 0.0251 0.0100 0.0179 0.0251 0.0179 0.0382 0.0100 0.0100 0.0100 0.0179 0.0179 0.0100 0.0179 0.0100 0.0100 0.0100 0.0100 0.0100 0.0100 0.0100 0.0100 0.0100
0.2943 1.1085 1.0507 0.7783 0.6368 0.5504 0.4654 0.3267 0.2537 0.3267 0.2106 0.2106 0.0992 0.1607 0.2106 0.1607 0.2920 0.0992 0.0992 0.0992 0.1607 0.1607 0.0992 0.1607 0.0992 0.0992 0.0992 0.0992 0.0992 0.0992 0.0992 0.0992 0.0992
2.0406
8.9073
188
Distribution models
Table 9.28: Frequency spectra of word classes from E-08 (from Popescu, Best & Altmann 2007) x
Nouns g(x)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 19 28 38 44 48 51 60
205 85 31 21 16 13 7 8 6 1 4 4 2 2 1 2 1 1 1 1 1 1
x
Verbs g(x)
1 2 3 4 5 6 7 8 9 10 12 13 14 15 17 40 42 51 85
114 40 28 9 7 3 4 4 1 3 3 1 1 3 1 1 1 1 1
Adjectives x g(x) 1 2 3 4 5 6 7 8 9 12 16 20 22
105 30 13 7 11 2 2 4 1 1 1 1 1
Adverbs x g(x) 1 2 3 4 5 6 7 8 11 12
38 28 11 4 3 1 3 2 1 1
Word classes
189
Table 9.29: Fitting the Zeta distribution to the spectrum of nouns in E-08 x
g(x)
NPx
x
g(x)
NPx
x
g(x)
NPx
x
g(x)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
205 85 31 21 16 13 7 8 6 1 4 4 2 2 1
218.55 64.76 31.79 19.19 12.97 9.42 7.19 5.69 4.62 3.84 3.25 2.79 2.43 2.13 1.89
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 0 0 2 0 0 0 0 0 0 0 0 1 0 0
1.68 1.51 1.37 1.25 1.14 1.05 0.96 0.89 0.83 0.77 0.72 0.67 0.63 0.59 0.56
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0.53 0.50 0.47 0.45 0.43 0.41 0.39 0.37 0.35 0.34 0.32 0.31 0.30 0.29 0.27
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
0 0 1 0 0 1 0 0 0 0 0 0 0 0 1
NPx 0.26 0.25 0.25 0.24 0.23 0.22 0.21 0.21 0.20 0.19 0.19 0.18 0.18 0.17 0.17
a = 1.7548; R = 60; DF = 30; X 2 = 30.01; P = 0.47 Table 9.30: Fitting the Zeta distribution to the spectrum of verbs in E-08 x
g(x)
NPx
x
g(x)
NPx
x
g(x)
NPx
x
g(x)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
114 40 28 9 7 3 4 4 1 3 0 3 1 1 3 0 1 0 0 0 0 0
120.22 35.13 17.10 10.26 6.91 5.00 3.80 3.00 2.43 2.02 1.70 1.46 1.27 1.11 0.98 0.88 0.79 0.71 0.65 0.59 0.54 0.50
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
0.46 0.43 0.4 0.37 0.35 0.32 0.30 0.29 0.27 0.26 0.24 0.23 0.22 0.21 0.20 0.19 0.18 0.17 0.16 0.16 0.15 0.15
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.14 0.13 0.13 0.12 0.12 0.12 0.11 0.11 0.10 0.10 0.10 0.09 0.09 0.09 0.09 0.08 0.08 0.08 0.08 0.07 0.07 0.07
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
a = 1.7750; R = 85; DF = 23; X 2 = 23.35; P = 0.44
NPx 0.07 0.07 0.07 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
190
Distribution models
Next, Tables 9.31 and 9.31 present the results for adjectives and adverbs. Table 9.31: Fitting the Zeta distribution to the spectrum of adjectives in E-08 x
g(x)
NPx
x
g(x)
1 2 3 4 5 6 7 8 9 10 11
105 30 13 7 11 2 2 4 1 0 0
106.07 28.54 13.24 7.68 5.03 3.56 2.66 2.07 1.65 1.35 1.13
12 13 14 15 16 17 18 19 20 21 22
1 0 0 0 1 0 0 0 1 0 1
NPx 0.96 0.82 0.72 0.63 0.56 0.50 0.44 0.40 0.36 0.33 0.30
a = 1.8942; R = 22; DF = 12; X 2 = 14.74; P = 0.26
Table 9.32: Fitting the Zeta distribution to the spectrum of adverbs in E-08 x
g(x)
1 2 3 4 5 6 7 8 9 10 11 12
38 28 11 4 3 1 3 2 0 0 1 1
NPx 44.72 15.89 8.68 5.65 4.05 3.08 2.45 2.01 1.68 1.44 1.25 1.10
a = 1.4924; R = 12; DF = 9; X 2 = 16.32; P = 0.06
In Figure 9.15 the fitting of the right truncated Zeta distribution to the spectrum of nouns is shown. As can be seen, the computed and the observed values are almost identical.
Word classes
191
Figure 9.15: Fitting the right truncated Zeta distribution to the spectrum of nouns in E-08
If we use the Zeta function, we have the advantage of ignoring all classes having zero frequency and a regular function fitting software is sufficient. In order to get better results we take into account the fact that, eliminating the zeros, the smallest frequency is 1. Hence we shift the function one step higher on the y-axis, otherwise the function would converge to zero. The parameter b in y = bx−a + 1 (9.41) can be estimated using the the first frequency. For the individual word classes we obtain the following results: Nouns: Verbs: Adjectives: Adverbs:
y = 206.6252x−1.5743 + 1, y = 113.6624x−0.1563 + 1, y = 104.1267x−1.8942 + 1, y = 39.6188x−1.2761 + 1,
The values are presented in Table 9.33.
R2 = 0.99 R2 = 0.99 R2 = 0.996 R2 = 0.89
192
Distribution models
Table 9.33: Fitting the shifted Zeta function to word classes in E-08 x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 19 28 38 44 48 51 60
Nouns y ytheor 205 85 31 21 16 13 7 8 6 1 4 4 2 2 1 2 1 1 1 1 1 1
207.63 70.39 37.65 24.30 17.40 13.31 10.66 8.82 7.50 6.51 5.74 5.13 4.64 4.24 3.91 3.00 2.09 1.67 1.53 1.47 1.42 1.33
x 1 2 3 4 5 6 7 8 9 10 12 13 14 15 17 40 42 51 85
Verbs y ytheor 114 40 28 9 7 3 4 4 1 3 3 1 1 3 1 1 1 1 1
114.86 38.01 20.18 13.03 9.38 7.23 5.85 4.91 4.23 3.72 3.03 2.78 2.58 2.41 2.15 1.29 1.27 1.19 1.08
x 1 2 3 4 5 6 7 8 9 12 16 20 22
Adjectives y ytheor 105 30 13 7 11 2 2 4 1 1 1 1 1
105.13 29.01 14.00 8.54 5.94 4.50 3.61 3.03 2.62 1.94 1.55 1.36 1.30
x 1 2 3 4 5 6 7 8 11 12
Adverbs y ytheor 38 28 11 4 3 1 3 2 1 1
40.62 17.36 10.75 7.75 6.08 5.07 4.31 3.79 2.86 2.66
As can be seen, there is almost perfect agreement. It can be noted that for adjectives both the distribution and the function have the same parameter a = 1.8942. The results are displayed graphically for nouns in Figure 9.16. From this result we can draw the following conclusion: If a frequency spectrum has distribution X, then all “correctly” defined classes of words in it, taken separately, follow the same distribution. The term “correctly” means here that the class should not be too fuzzy. Hence this criterion can help to delimit a “class” of words. We can add words to a class as long as the frequencies follow the original distribution. Of course, the hypothesis must be thoroughly tested on many texts and many classes.
Word classes
Figure 9.16: Fitting the Zeta function to noun spectrum in E-08
193
10
The relation of frequency to other word properties
We usually say that properties are attributes of objects. There is no objection because objects do not exist without properties. To grasp objects conceptually is not a matter of the object but a process performed by the recognizing subject. Our conceptual grasping of properties is never finished since it is a cognitive process. Earlier, we were interested in just a few properties of the word and today we investigate a considerably increased number of properties. This fact forces us to assume that the number of properties of an object is practically infinite and the question as to which of them can already be established depends just on the state of research. Hence properties are real entities but it depends on us how they look like. It is probably the word for which most properties have been established. This is because the word – whatever its definition – is the basic unit of the dictionary, of syntax, semantics, psycholinguistics, language evolution etc. However, not all properties of words have been studied with the same intensity, and rather a small number of them has been investigated quantitatively. Many “measurements” have been performed on the nominal scale by simply attributing the words to some classes which seemed to be relevant for a problem. In principle, all properties are quantifiable and it is always possible to devise an operalization leading to a quantitative expression of the property. Let us, for the sake of illustration, enumerate some well known properties of the word. 1. Length that can be measured in terms of the number of phonemes, letters, syllables, morae or morphemes. Sometimes, one calls it material complexity. 2. Frequency: number of occurrences in texts. 3. Polysemy: number of meanings in a dictionary. 4. Morphological status: simple word, reduplicated word, derived word, compound. 5. Class membership: the number of word classes to which it belongs, e.g. by conversion (the hand, to hand). 6. Polytextuality: the number of texts, in which it occurs or the number of contexts (direct neighbours in text, collocation).
196
The relation of frequency to other word properties
7. Productivity: the number of derivatives, compounds, reduplications that can be built with the word. One finds sufficient data on the Internet. 8. Age: the number of years or centuries from the first appearance of the word in texts. 9. Provenience: the number of languages over which it came into the language under investigation. 10. Valency: the number of obligatory actants or circumstants. 11. The number of its grammatical categories: conjugation, declension, comparison, time, mode etc., or the number of affixes the word can be combined with (e.g. not every verb can be combined with all prefixes). 12. Degree of emotionality vs. notionality. Compare for example the emotionality of the words “mother” and “pencil”. 13. Polyanna: the degree of the word on the “good – bad” scale. 14. Degree of abstractness vs. concreteness of the word, e.g. “beauty” vs. “pencil”. 15. Specificity vs. generality, e.g. “pencil” vs. “instrument”. 16. Degree of dogmatism, e.g. “can” vs. “must”, “all” vs. “some”, “always” vs. “sometimes”. 17. Number of associations (= connotative potential) that arise when hearing a word. There are comprehensive dictionaries of word associations. 18. Synonymy: number of synonyms in a dictionary. 19. Number of different functions in the sentence, e.g. a word can be subject, object, predicate etc. 20. Diatopic variation: in how many areas of the dialect atlas can the word be found? 21. Dialectal competition: how many competitors of the word are there in a dialect atlas? 22. Discourse properties: to what degree can the word be attributed to a social group? 23. The degree of standardization: standard language, sociolect, argot etc. 24. Diversity: in how many word classes can the word get in by derivation, e.g. German “Bild” (N) → “bildhaft” (Adj), “bilden” (V), “bildlich” (Adj, Adv). 25. Originality: genuine, borrowing, calque, folk etymology, substrate etc. 26. Phraseology: in how many idioms can the word be found? 27. Degree of verb activity, e.g. “sleep” vs. “run”. 28. Degree of expression of a property by an adjective, e.g. “nice”, “pretty”, “beautiful”.
The relation of frequency to other word properties
197
The list could be enlarged studying language descriptions of different kinds. Even the classes of properties can be enlarged. Above, we mentioned material, morphological, syntactic, semantic, psycholinguistic, historical-etymological, combinatorial, phraseological, and diversity classes and all of them can be subdivided further and enriched. All these properties are connected at least with one other property; none of them is isolated. They build a selfregulation cycle whose full capturing is (preliminarily) impossible. Nevertheless, we can at least show the bases of this self-regulation. We restrict ourselves to word frequency and some of its well known connections with other properties (cf. Köhler 1986). An overwhelming part of this research can be performed following the method shown below. Consider the length of a word form and its frequency in a text. Since Zipf (1935), it has been known that length is a function of frequency. The opposite direction of dependence is not quite reasonable. In speech, frequent forms tend to be shorter, a phenomenon that can be found e.g. in abbreviations, in greetings, in words like German “Uni” for “Universität” used among students but seldom outside. But out of two synonyms one does not choose always the shorter one: it depends on the situation, e.g. honorific pronominal appellatives are always longer than their colloquial counterparts, courtesy affixes are longer than their plain counterparts (e.g. Japanese –masu vs. –ru with verbs), writers use rather a seldom but stylistically more impressive synonym; in Japanese, the use of complex signs is a matter of erudition, etc. The phenomenon can be evidenced by any language. The properties of the chosen element are always in accordance with some system requirements; these can be very general but also ad hoc. Köhler (1986, 2005) integrated them into a complex system of self-regulation and established the following requirements sufficient for the majority of relationships: Coding De-specification Transmission security Minimisation of production effort Minimisation of decoding effort Minimisation of memory effort Invariance of the expression-meaning-relation Context specifity Maximisation of complexity Limitation of embedding depth Adaptation
Specification Application Economy Minimisation of encoding effort Minimisation of inventories Context economy Flexibility of the expression-meaning-relation Efficiency of coding Preference of uniform branching direction Minimisation of structural information Stability
198
The relation of frequency to other word properties
Each of these requirements is active at different occasions, sometimes more of them at the same time. Their relationships can be studied in isolation or in the framework of a broader system. In an isolated investigation, one of the properties is considered an independent variable (e.g. frequency), another one a dependent variable (e.g. length). Every other variable is considered constant (= ceteris paribus) or a function of the independent variable; with regard to linguistic issues, the latter usually gets the smaller the greater the independent variable is. The variables in Köhler’s system (they are considered always in their logarithmic form, e.g. L = f (F) means that the logarithm of length is a function of the logarithm of frequency., i.e. ln L = f (ln F). We shall use the latter notation. A dependence can be graphically displayed as follows
Since length changes (decreases) proportionally with increasing frequency, we insert in the relation a proportionality constant −b and obtain
Such a scheme is read following the arrows and we obtain ln (length) = −b · ln ( f requency) or shorter ln L = −b · ln F. Since every relationship in language is impaired by noise originating from measurement error, random drift, or even from weak influences of other factors, one usually adds an “unknown” factor C. Further, in synergetic linguistics, one tries to interpret the parameter b by any one from Köhler’s inventory of requirements obtaining at last minP (= minimisation of production effort). This yields ln L = C − b · ln F and, after solving for L, L = eC F −b = aF −b .
The relation of frequency to other word properties
199
Consider now the general theory in which the dependent variable changes not only proportionally to the independent variable but also to its own size. In that case we can write dL b = − dF (10.1) L F yielding the same result. Though the results of both approaches are identical, the systems-theoretical approach is more lucid and mathematically simpler especially in case of several independent variables forcing us otherwise to use partial differential equations. This relationship originates form Zipf (1935) but it has been studied later on by several researchers e.g. Guiter (1974), Köhler (1986) etc. The empirical relationship can be established in the following way: The length of word forms is measured in terms of syllable numbers.1 The length of words occurring in the same frequency class will be computed by taking averages of the class, e.g. the length of words occurring once is given by the mean syllable length of these words. For the sake of illustration let us take a longer text of at least 1000 words, e.g. H-04. Computing the mean lengths of words of the given frequency classes we obtain the results in the first two columns of Table 10.1. Since 5 or fewer cases are not quite reliable, we must average some frequency classes, too. Thus, we took averages of classes 5 to 6 and 8 to 76. The resulting curve for the Hungarian text H-04 is L = 3.3596F −0.4425 1. Counting phonemes, graphemes or letters means to skip one linguistic level (the morph or syllable level), so that the smooth course of the curves can be disturbed. In this case one may obtain a “good” result just by chance or one would need differential equations of second order.
200
The relation of frequency to other word properties
yielding a determination coefficient R2 = 0.88 and highly significant t- and F-values. Table 10.1: Fitting the means of word lengths in dependence on (the mean) frequency class in the Hungarian text H-04 Frequency class F 1.00 2.00 3.00 4.00 5.50 27.28
Average length L 3.35 2.44 2.56 1.43 1.33 1.00
Computed L 3.36 2.47 2.07 1.82 1.58 0.78
The relationship holds always if the text is long enough and the frequency classes are reliably occupied. Nevertheless, there will be differences both between texts and between languages. Looking at the results in Table 10.1 we see that the theoretical curve goes below 1. Since in practice this is not possible – because we consider only classes having frequency greater than zero – the general theory is usually presented rather as dy = g(x)dx (10.2) y−d where d represents the asymptote of decreasing curves, in our case b dL = − dF L−d F
(10.3)
L = aF −b + d
(10.4)
yielding where d = 1. Fitting this curve to our data we obtain the results in Table 10.2. The curve is now L = 2.4476F −0.8356 + 1 and R2 = 0.86, i.e. empirically slightly worse but theoretically more adequate. In the same way all other properties can be placed into a system of dependencies, a task that will not be accomplished here because it forms a very complex discipline in its own right. Not all of the above mentioned properties are in relation to frequency but for many of them at least a distant relationship can be shown. Furthermore, not all properties of words can be ascertained
The relation of frequency to other word properties
201
Table 10.2: Fitting (10.4) to Hungarian data H-04 Frequency class F 1.00 2.00 3.00 4.00 5.50 27.28
Average length L 3.35 2.44 2.56 1.43 1.33 1.00
Computed L (10.4) 3.45 2.37 1.98 1.77 1.59 1.15
from a given text, for many of them a dictionary, a corpus, or test persons are necessary. We shall not pursue this direction here.
11 11.1
Word frequency and position in sentence
Introduction
Word forms represent the elementary concept forming level (empirical concept) which can be abandoned in different more abstract directions. One can replace a word form by a lemma, by its part of speech, by its denotation and references, by its different kinds of length-related or other properties, and finally, by its frequency. Whatever the replacement, the original empirical “word” will be partially or completely bereft of its original form. We obtain a series of symbols, which can be considered as coded text representing a time series. As soon as one has such a series, one asks whether there are some tendencies, whether the series abides by laws, what the properties of such a series are, etc., i.e. hypotheses can be set up and tested. There are two possibilities of analysing such sequences: 1. One takes into account the complete text, not caring about its partitioning in sentences, chapters or other units. The independent variable is the position in text. 2. One takes into account units like sentences, and the variable is the position in the given sentence, i.e. the counting of positions begins anew after every sentence. In this case the sequence is no genuine time series but rather something like a motif, not unlike beats in music or verses in poetry. Such a sentence motif can, of course, be partitioned in submotifs – see Köhler (2006, 2007) for length motifs, Köhler & Naumann (2008) for linguistic motifs, and Boroda (1973, 1982, 1984, 1986, 1991, 1992) for musical motifs; see also Meyer-Eppler (1969). If the word forms are replaced by numbers (representing properties) then the sequence can contain a kind of elementary rhythm. If they are replaced by qualitative symbols, e.g. accent and no accent, usually runs or classical meters arise. One can set up a working hypothesis that whatever the replacement of word forms by other abstract symbols, the sequence displays at least some elementary rules, tendencies, fractals, rhythms, motifs etc., i.e. the sequence is structured in some way. The task (in the far future) is to state whether all of these phenomena abide by laws. In this chapter we shall scrutinize the replacement of word forms by their frequencies in the same text. Uhlíˇrová (2007) speaks of frequency rhythm.
204
Word frequency and position in sentence
Our problem is to capture it formally. The construction of data is simple: take a text and compute frequencies of all word forms. Then replace each word by its absolute frequency. Consider the case prepared by Uhlíˇrová (2007) for the Bulgarian text B-10 in which the first sentence is as follows Благодаря ти за писмото ти от седми ноември. Transcription: Blagodarja ti za pismoto ti ot sedmi noemvri. Literal translation: I thank you for letter-the your of seventh november. word position in sentence: word frequency in the text:
1.
2.
3. 4.
5.
6.
7.
8.
1
4
18 1
4
5
1
4
Instead of words we have a sequence of numbers representing the frequencies of words. For the sake of lucidity, we order the sentences according to the number of words in them and obtain the result in Table 11.1, where A denotes frequent entities, and B non-frequent entities. The frequencies in the line can thus be divided in two groups: frequent (A) and non-frequent (B). For each line in Table 11.1 we have the mean of the frequencies in the last column. In the last columns of Table 11.1 we present the following numbers: n = number of elements in the sequence, (n = nA + nB ) nA = number of A’s in the row, nB = number of B’s in the row, rA = number of runs formed by A’s, rB = number of runs formed by B’s, r = rA + rB .
205
Introduction Table 11.1: Replacing words by frequencies in the Bulgarian text B-10 Sentences presented as sequences of word frequencies
n
nA
nB
rA
rB
r
– – BABA 4 BAAA 4 AABABA 6 BABAAA 6 AABBAA 6 AAABBAA 7 BAAAABAA 8 AABAAAAA 8 AABAAAABA 9 BAAABBAAAB 10 BAAABAABABAA 12 AAAABAAAABBA 12 AABABAAAAABA 12 AAAAABAAAABAA 13 AABABAAAABAAA 13 AABAABABAAAAAA 14 ABAABAABAABABA 14 BABAABAABABAAAA 15 AABBAAAABAABBBAA 16 BAAAABAAABABAABA 16 AAAAAAABAAAAABAA 16 ABAABABAAAAAABAA 16 ABAAABBAAAABAAAA 16 BAAAAAABBAAABABA 16 ABABBAAAAABAAABAA 17 ABAAABABAAAAAABAA 17 BAAABAABABABAABAA 17 AABABAABAAAAAAAAA 17 AAABAAABAAABAABAAA 18 ABAAABBAAABAABBABA 18 AABAABBAAABABABBAAAA 20 ABAABABAAAAAAAABAABAA 21 AAAAAABABAAAAABAAAABABA 23 AABAABAAAABAABBAAABAABAA 24 AABBAABABBABBBAAABAAABAA 24 ABBABAAAAAAAABBABBAAAAAAABAAAAAAA 33 AABAABAABABAAAAAAABAABABAAAAAAAABABA 36
– 2 3 4 4 4 5 6 7 7 6 8 9 9 11 10 11 9 10 10 11 14 12 12 11 12 13 14 14 14 11 13 16 18 17 14 25 27
– 2 1 2 2 2 2 2 1 2 4 4 3 3 2 3 3 5 5 6 5 2 4 4 5 5 4 3 3 4 7 7 5 5 7 10 8 9
– 2 1 3 2 2 2 2 2 3 2 4 3 4 3 4 4 6 5 4 5 3 5 4 4 5 5 6 4 5 6 6 6 6 7 7 6 10
– 2 1 2 2 1 1 2 1 2 3 4 2 3 2 3 3 5 5 3 5 2 4 3 4 4 4 6 3 4 5 5 5 5 6 6 5 9
– 4 2 5 4 3 3 4 3 5 5 8 5 7 5 7 7 11 10 7 10 5 9 7 8 9 9 12 7 9 11 11 11 12 13 13 11 19
206
Word frequency and position in sentence
11.2
Runs of binary data
A run is a sequence of equal letters, e.g. in A A B A B B A there are 3 runs of A and 2 runs of B. In the first line of Table 11.2 we have 1,1 (both numbers equal to the mean) hence no test can be performed. If a frequency is smaller than or equal to the mean, it can be symbolized as A, if it is greater, it is B. Hence Table 11.1 can be presented in symbolic form in Table 11.2. Let us restrict our investigation first to sentences. If there is a “frequency rhythm”, how do we state it? Sentences of different lengths are not comparable but we do not have a sufficient number of sentences of the same length to venture even hypothetical statements. Hence, we must first perform some tests for individual sentences. If there is a rhythm, then we must find an order which cannot be accounted for by chance. Let us begin with nonparametric statistics. The probability that the letters are placed randomly can be computed using standard formulas or ready-made tables. In our case, the presence of a rhythm is evidenced if the number of runs in the sequence is either too great or too small. If the number of runs is too great, then there is a certain regularity of placing frequent and infrequent words; if it is too small, then there is an “aimed” placement of infrequent words (usually autosemantics) in a long sequence. This is only the first (reduced) approximation to frequency rhythm, but it can be used for different units. Given nA ,nB ,rA , rB , the probability of exactly r runs is
P(X = r) =
! ! n − 1 n − 1 A B 2 r/2 − 1 r/2 −1 ! n n A n − 1 n − 1 A B (r − 1)/2 (r − 3)/2
n nA nA − 1 nB − 1 (r − 3)/2 (r − 1)/2 + n nA
if r is even
+
(11.1)
if r is odd
Runs of binary data
207
Table 11.2: Nominal presentation of frequencies No.
Sentences presented as sequences of word frequencies
Mean
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
1, 1 2, 1, 3, 1 2, 1, 1, 1 2, 3, 25, 1, 18, 1 25, 6, 18, 1, 6, 2 9, 4, 25, 18, 1, 1 2, 1, 2, 9, 18, 1, 2 9, 1, 1, 1, 2, 25, 2, 2 1, 4, 18, 1, 4, 5, 1, 1 3, 1, 25, 4, 1, 2, 1, 5, 1 2, 1, 1, 1, 2, 4, 1, 1, 1, 2 9, 1, 1, 18, 1, 1, 7, 1, 26, 2, 2 2, 1, 4, 1, 25, 1, 3, 3, 1, 6, 8, 3 1, 1, 26, 2, 23, 1, 4, 3, 1, 1, 23, 1 1, 1, 2, 4, 2, 23, 1, 1, 3, 2, 25, 4, 2 1, 2, 26, 1, 18, 2, 6, 1, 3, 26, 4, 1, 1 4, 1, 23, 1, 2, 25, 1, 26, 1, 1, 1, 1, 1, 1 2, 7, 1, 1, 26, 1, 1, 8, 3, 1, 23, 1, 18, 1 18, 1, 26, 3, 1, 25, 2, 1, 23, 1, 18, 4, 1, 5, 2 1, 3, 5, 6, 1, 1, 1, 1, 4, 2, 1, 26, 4, 4, 1, 1 9, 4, 1, 2, 5, 23, 1, 4, 6, 8, 1, 18, 1, 1, 23, 1 1, 1, 1, 3, 1, 4, 1, 23, 1, 1, 1, 2, 1, 26, 1, 1 2, 6, 1, 1, 23, 1, 18, 1, 2, 2, 3, 1, 1, 23, 1, 2 3, 18, 2, 2, 1, 6, 8, 2, 1, 4, 23, 2, 3, 1, 3 23, 3, 5, 2, 1, 6, 4, 25, 18, 1, 7, 1, 16, 1, 26, 2 1, 18, 1, 6, 18, 2, 1, 4, 1, 1, 6, 1, 1, 1, 23, 4, 3 5, 18, 1, 4, 1, 18, 1, 26, 2, 1, 1, 2, 5, 1, 26, 1, 1 9, 1, 3, 1, 25, 2, 1, 7, 1, 7, 1, 26, 1, 1, 18, 1, 1 3, 2, 7, 1, 26, 1, 1, 4, 1, 1, 1, 3, 2, 1, 1, 1, 1 1, 1, 1, 26, 3, 1, 2, 26, 1, 3, 1, 6, 1, 1, 18, 4, 1, 1 1, 18, 1, 2, 1, 25, 18, 1, 2, 8, 18, 1, 1, 23, 18, 1, 18, 1 6, 1, 26, 2, 2, 9, 18, 1, 1, 2, 23, 2, 26, 2, 9, 18, 2, 3, 8, 1 1, 18, 1, 2, 26, 4, 6, 1, 1, 1, 1, 3, 1, 1, 1, 6, 1, 1, 8, 3, 2 2, 4, 2, 4, 1, 2, 18, 1, 25, 1, 1, 1, 2, 1, 25, 1, 2, 1, 2, 5, 2, 5, 1 2, 6, 8, 1, 2, 25, 1, 1, 1, 1, 26, 1, 1, 18, 25, 1, 1, 1, 18, 1, 1, 23, 4, 3 1, 1, 25, 18, 1, 1, 26, 1, 25, 18, 1, 23, 25, 18, 1, 1, 4, 25, 4, 2, 1, 23, 4, 3 1, 8, 25, 1, 26, 1, 1, 1, 1, 2, 1, 1, 1, 23, 26, 1, 23, 26, 1, 1, 2, 2, 2, 1, 1, 25, 4, 2, 1, 4, 1, 4, 1 4, 1, 26, 2, 3, 9, 4, 1, 25, 1, 23, 1, 3, 2, 6, 1, 1, 1, 7, 3, 2, 23, 1, 18, 3, 1, 6, 1, 1, 2, 1, 3, 18, 1, 26, 2
1.00 1.75 1.25 8.33 9.67 9.67 5.00 5.40 5.83 4.78 1.60 6.27 4.83 7.25 5.46 7.08 6.36 6.71 8.73 3.88 6.75 4.31 5.50 5.00 8.81 5.41 6.71 6.24 3.35 5.44 8.78 8.10 4.71 4.74 7.17 10.50 6.70
38
6.25
There are three possible alternative hypotheses to test the null hypothesis of randomness: 1. there are too many runs (a one-sided hypothesis), 2. there are too few runs (also one-sided), or 3. the number of runs is not random (two-sided).
208
Word frequency and position in sentence
In order to see the direction of rhythm we shall use the one-sided hypotheses, and the available tables – see e.g. Bortz, Lienert, and Boehnke (1990: 760ff.) – in which one can find the critical number of runs. The tables are made for nA , nB ≤ 20; for higher numbers one can use the normal approximation, |r − E(r)| − 0.5 z = (11.2) σr where E(r) = 1 + and
σr =
s
2nA nB n
2nA nB (2nA nB − n) . n2 (n − 1)
(11.3)
(11.4)
The correction for continuity (0.5 in (11.3)) is used for nA , nB < 30. Let us perform the computation for the last sentence having n = 36, nA = 27, nB = 9, rA = 10, rB = 9, r = 19. For the normal test we obtain from (11.3) E (r) = 1 + [2 (27) 9] /36 = 14.5 and from (11.4)
σr =
s
2(27)9[2(27)9 − 36] = 2.1958. 362 (36 − 1)
Inserting these values in (11.2) we obtain z =
|19 − 14.5| − 0.5 = 1.82. 2.1958
In one-sided testing we reject the null hypothesis of randomness at the α = 0.95 level (for which z = 1.64) and accept the alternative hypothesis that the sentence has too many runs, i.e. there is a frequency rhythm. The computation using the exact formula (11.1) is very lengthy, especially with great numbers, because all permutations composing r must be taken into account, hence we use the z-criterion, even though in some cases it may yield unsatisfactory results. It is to be remarked that this “binary” approach is only the first step.
Runs of multiple data
209
Testing individual lines in Table 11.2 using the z-criterion, we can state that there are only 4 significant cases in which there are more runs than expected, namely line 18, 28, 34, 38 (with z = 1.87; 5.04; 1.71; 1.82 respectively). Evidently, frequency rhythm organizes itself in rather long sentences. Though the next statement may be premature – one needs a number of analysed texts to corroborate it – we can conjecture that frequency rhythm is a matter of self-organization and it is the stronger, the longer the sentence. Again, the “type” of language may play a decisive role. 11.3
Runs of multiple data
The frequencies of words can be codified in different ways, taking different frequency intervals, e.g. 1–2, 3–4, 5–6, . . . , and assigning to them letters: A, B, C, . . . . But they can also be left as they are and the numbers can be considered as symbols. In that case we have e.g., for the sentence 37 in Table 11.1 1,8,25,1,26,1,1,1,1,2,1,1,1,23,26,1,23,26,1,1,2,2,2,1,1,25,4,2,1,4,1,4,1 Here we first count how many times each different numbers occurs and denote each count by ni . We obtain n1 = 17 n2 = 5 n3 = 3 n4 = 1 n5 = 2 n6 = 2 n7 = 3
(representing the number 4) (representing the number 8) (representing the number 23) (representing the number 25) (representing the number 26). k
There are k = 7 different numbers used. We calculate n = ∑ ni = 33 and i=1
count r = 24 runs. Now we ask whether the number of runs significantly differs from the expected number, i.e. we test a two-sided hypothesis. To this end we use the normal criterion r − E(r) z = p (11.5) Var(r)
where the expectation is
k
∑ n2i
E(r) = n + 1 − i=1 n
(11.6)
210
Word frequency and position in sentence
and the variance is k
∑
Var(r) =
i=1
n2i
k
∑
i=1
n2i + n(n + 1)
k
− 2n ∑ n3i − n3 i=1
n2 (n − 1)
.
(11.7)
Inserting the above numbers in (11.6), (11.7) and finally in (11.5) we obtain E (r) = 33 + 1 − 172 + 52 + 32 + 12 + 22 + 22 + 32 /33 = 23.67,
Var (r) = {341 [341 + 33 (34)] − 2 (33) 5109 − 35937} / [1089 (32)] = 3.6086,
z = (24 − 23, 67) /1.8996 = 0.17. We can state that in this sentence the number of runs is random. The critical z = ±1.96. If r is significantly smaller than E(r), i.e. z < −1.96, the equal frequencies have the tendency to crowd; if r is significantly greater than E(r), i.e. z > 1.96, then there is a tendency to place equal frequencies at a certain distance from one another, to thin out the crowding. These tendencies can depend on language (grammar), on sentence length, on genre, etc. A very thorough and wide-ranging research would be necessary to get reasonable results. Checking in this way all sentences of the given text, equal to or longer than 14 words, we obtain the following result: there is only one sentence with significant crowding but four sentences with thinning out the crowding. The rest displays a non-significant regime of forming runs.
11.4
Absolute positions
There are two positions in a sentence which can be compared without any transformation, namely the initial and the final one. All other positions must be considered relatively, e.g. position 5 in a sentence consisting of 6 words is not the same as position 5 in a sentence consisting of 30 words. Let us first have a glance at the absolute positions. It depends strongly on individual languages whether the sentence begins with a frequent article or with a less frequent noun, it depends on genre,
Absolute positions
211
whether the last word (say, of a verse) is a hapax legomenon or another, more frequent word. Let us consider these positions separately.
11.4.1 Word frequency in the final position Using Uhlíˇrová’s (2007) analysis of ten Bulgarian texts (B-01 to B-10) we collect the frequencies of words in the last position in the sentences and obtain the results presented in the upper part of Table 11.3. In this table, the order of sentences does not play any role; they have been ordered according to their lengths. In the last column of the table, the average frequency of words in the final position is given for each text. The average word frequency in the final position falls within a very narrow frequency interval in all texts.
B-01 1 B-02 1 B-03 1 B-04 1 B-05 1 B-06 1 B-07 1 B-08 1 B-09 1 B-10 1
1 1 9 1 1 1 1 1 1 1
1 1 1 1 3 1 2 1 1 1
3 2 1 1 1 3 3 1 1 1
3 1 2 1 1 1 1 1 1 2
2 1 1 1 2 1 1 4 1 1
2 1 1 2 1 1 1 1 1 2
2 2 1 1 1 4 1 1 2 2
2 1 2 1 1 1 1 1 2 1
1 1 2 1 1 1 1 1 2 1
1 1 3 1 2 2 1 1 1 2
2 1 1 13 1 1 1 1 2 2
1 1 1 1 2 2 1 1 2 3
1 3 1 1 2 1 2 1 1 1
1 1 1 1 1 2 1 1 1 2
1 2 1 2 1 1 1 1 1 1
3 1 2 2 7 2 1 1 1 1
5 2 2 1
1 2 1 1
1 1 1 1
2 1 1 1
1 1 2 1
1 1 1 1 1
1 4 1 1 1 1 3 1
1 1 13 1 1 1 1 1 15 2
8 2 1 1 4 2 1 3 1 25
1 2 1 5 1 2 1 3 2 9
2 1 6 1 7 24 1 1 4 2
1 13 6 1 1 1 3 2 6 9
6 1 6 2 2 1 16 1 1 1
2 3 2 1 1 2 3 1 17 3
1 5 6 1 1 4 1 2 1 2
6 1 1 1 7 27 1 1 1 9
29 1 1 1 1 3 1 1 1 2
17 1 1 1 1 1 1 5 1 1
2 4 3 1 2 1 2 1 17 1
2 2 2 7 2 1 4 1 20 1
3 8 3 1 2 12 1 1 1 4
2 13 2 1
3 1 1 1 2 1 1 1
1 2 1 1 1 1 4 6 1 1 1 1 1 1 3 1 1.71 1 3 1 1.36 9 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1.67 1 1.65 1.71 1 4 1 1 2 1 1 1 1 1 2 1.52 4 1 1 1 1 1 1 2 1.33 1.17 1 1 1 1.28 2 3 2 3 1 1 1 1 1 1 2 1 3 3 1 2 1.55
Initial position B-01 1 B-02 1 B-03 1 B-04 1 B-05 1 B-06 1 B-07 1 B-08 1 B-09 1 B-10 1
2 1 6 1 1 1 8 2 3 2
2 1 9 5 19 2 1 5 20 2
1 1 1 2 2
1 1 6 1
1 2 1 7
3 4 1 3
3 2 2 1
27 1 1 1 1 4 8 4 1 1 2 1 18 1 9 1
5 1 2 1 5 12 29 1 1 29 5 3 6 29 2 2 6.03 3 3 2 3.12 4 5 1 1 3 5 1 1 1 14 1 8 1 2 1 2 1 13 3 6 3.67 1 2.00 3.18 2 18 1 1 1 12 3 2 28 1 18 6.18 7 15 1 8 1 4 3 2 3.53 1.83 3 1 1 4.96 2 3 23 1 5 9 3 1 1 6 1 2 2 1 1 4 4.53
Word frequency and position in sentence
Final position
212
Table 11.3: Frequency of word in the final and initial positions in each sentence (in ten texts)
Absolute positions
213
Thus, the importance of the final (= informationally heavy) position for the frequency rhythm is empirically supported by our data. In principle, it would be possible to test the small frequency at the final position using variance analysis or a test for the difference of frequencies in final and nonfinal positions, but to this end we would be forced to consider all words in all texts. Instead, we set up the following hypothesis: On the average, the last words in sentences of a text vary in the frequency interval (1, 2) or, in other words, their frequency does not significantly differ from an overall (expected) mean 1.5. Not considering the length of the individual texts (the weight of the empirical averages) we compute the mean of the last column of Table 11.3 and obtain 1.495. The standard deviation of the empirical means is √ 0.19665 and the standard deviation of the mean is 0.19665/ 10. Hence the √ test (1.495 − 1.5) 10/0.19665 = −0.08 shows that our hypothesis is correct: the final position of sentences contains words of very low frequency, approximately in the interval h1, 2i. The existence of a position, either final or (usually) any other one, marked for the frequency rhythm, if proved, could be important for text/speech recognition procedures; in a continuous text/speech in which sentence boundaries are not formally/intonationally indicated, it could be the frequency rhythm that could possibly help. This is the case not only in old folklore of different nations but also, e.g., in older German administrative texts. A mechanical procedure could be devised computing the probability of sentence end after a certain word.
11.4.2 Word frequency in the initial position Now let us take the same ten texts and let us have a look at word frequency in the initial position. Let us ask whether also the initial position may be relevant for the frequency rhythm. The data, i.e. the word frequencies in the initial position in each sentence, are given in the lower part of Table 11.3. Now, the result is negative. The average frequencies (last column) are significantly higher than those in the final position and they are located in a very great interval. This fact can be explained by the interplay of various factors, e.g. the typical cohesive and referential function of words standing at the sentence beginning, the existence of article before the noun, the genre (e.g. written or spoken language, drama etc.) etc.
214 11.5
Word frequency and position in sentence
Relative position
Since positions in sentences of different lengths cannot be directly compared, one must find a way of making them comparable. This can be done in different ways. One of the possibilities is to investigate only sentences of the same length, a method adequate only for texts in which all lengths are sufficiently represented. Another one is to relativize the position by dividing it by the length of the longest sentence. Here we divide each sentence in the text into ten equal length intervals (0.0, 0.1), (0.1, 0.2),. . . , (0.9, 1.0), regardless of its actual length, i.e. we divide each position by the number of positions in the sentence. This partitioning is just as subjective as any other one but it is quite common. From the theoretical point of view, the best relativization is the one for which one can find some regularities. For each interval, let us find all words which fall into it. Using the above text B-10, we state only the position of words occurring once, i.e. that of hapax legomena whose frequency is sufficient for further analysis. The result is given in Table 11.4. Thus, e.g. in the interval (0.0, 0.1) there are 14 words with f = 1, in interval (0.1, 0.2) there are 22 words with f = 1, etc. Table 11.4: Word frequency in relative position in text B-10 Number of words with frequency f = 1 in each interval 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 yx 14 22 22 23 34 16 28 25 27 35
In Figure 11.1 the frequencies of hapax legomena in individual relative positions are presented. It can easily be seen that the numbers of hapax legomena in the individual intervals are not homogeneous: the chi-square test for homogeneity yields X 2 = 30.44 with 9 DF and P = 0.0004. Hence, there is no random oscillation, there is probably a rhythm. Looking at Figure 11.1 we see that there are two great “waves” dividing the series exactly in the middle. Hence, we may venture to express the following hypothesis: In sentences, there exist positional intervals in which some frequency rhythm is manifested. In order to test the presence of a rhythm we perform a Fourier analysis showing explicitly a method which can be carried out even with a pocket calculator (cf. Altmann 1988: 200ff.). Today, usually computer programs are
Relative position
215
Figure 11.1: Occurrence of hapax legomena in relative positions
used. In order to avoid integrals, we renumber the intervals in x = 1, 2, . . . , 10. A Fourier series can be written as q
yx = A0 +
∑ [Ai cos(2π x fi ) + Bi sin(2π x fi )]
(11.8)
i=1
where A0 , Ai , Bi (i = 1, 2, . . . , q) are parameters, fi = i/N is the i-th harmonic and q = N/2 for even N while q = (N − 1)/2 for odd N, N being the number of x-points (here positions). The individual parameters can be computed as A0 = y¯
(11.9)
Ai =
2 N ∑ yx cos(2π x fi ) N x=1
(11.10)
Bi =
2 N ∑ yx sin(2π x fi ). N x=1
(11.11)
For even N, the q-th coefficients are Aq =
1 N ∑ yx (−1)x N x=1
Bq =0.
(11.12)
216
Word frequency and position in sentence
Computing, e.g. A3 we obtain from our data A3 = (2/10) [14 cos(2π · 3/10) + 22 cos(2π · 2 · 3/10)+
+ 22 cos(2π · 3 · 3/10) + . . . + 35 cos (2π · 10 · 3/10)
=0.5618.
The mean of all observations is A0 = 24.60. The other coefficients are shown in Table 11.5 (the computation was performed to 10 decimal places and rounded). The intensity is computed as I( fi ) =
N 2 (A + B2i ). 2 i
(11.13)
Table 11.5: Fourier analysis of frequency rhythm of hapax legomena in B-10 i
Harmonic fi
Period
1 2 3 4 5
0.1 0.2 0.3 0.4 0.5
10 5 3.3 2.5 2
∑
Ai
Bi
0.3382 3.0493 0.5618 6.8507 −0.4000
−2.4172 −3.4516 −0.0833 −2.9218 0
Intensity I( fi ) 29.7872 6.0586 1.6128 277.3414 0.8000
%s2 7.17 25.52 0.39 66.73 0.19
415.6000
The sum of intensities yields the sum of squared deviations and %s2 shows the percentage of the given intensity in the total. We choose the strong intensities, namely those in periods 10, 5 and 2.5, and set up our empirical formula as y = 24.6 + 0.3382 cos(2π x/10) − 2.4172 sin(2π x/10)+ + 3.0493 cos(2π x2/10) − 3.4516 sin(2π x2/10)+ + 6.8507 cos(2π x4/10) − 2.9218 sin(2π x4/10)
yielding the values in the last column of Table 11.6: The values from Table 11.6 are graphically presented in Figure 11.2 (see p. 217). The hypothesis is reasonable: the determination coefficient is R2 = 0.99. Autosemantic words with low frequency are connected by synsemantics with high frequency. Hence the frequency motion should be more or less
Relative position
217
Table 11.6: Fourier analysis of position-frequency data x
y
1 2 3 4 5 6 7 8 9 10
14 22 22 23 34 16 28 25 27 35
yt 13.85 22.81 21.10 23.31 34.16 16.15 27.19 25.90 26.69 34.83
Figure 11.2: Fourier analysis of frequency rhythm, from Uhlíˇrová (2007)
regular. We suppose that in long texts more frequency classes will display a rhythm but in our text it is only the class of hapax legomena. For the middle interval, much more data would be necessary to reveal at least some factors which may be at play. As far as the final interval is concerned, its existence is strongly supported by the general principle of the communicative structure of the sentence, wellknown to the linguists under various names, such as the principle of end weight, the principle of the increasing communicative dynamism, the themerheme structure or topic-comment structure, given and new information, etc.
218
Word frequency and position in sentence
This principle says that the element(s) with the maximal weight/the highest degree of communicative dynamism/rheme proper, the bearer of new information etc. display(s) a strong tendency to be placed at the sentence end. It is highly probable that an element which should push the information content of communication forward, will be expressed by a word with f = 1 or with f > 1, but still with quite a low frequency.
11.6
Frequency motifs
The sequence of frequencies can be considered not only as partitioned in sentences but also in other smaller parts, called F-segments. The study of formal segments has been initiated in music by Boroda (1973, 1982, 1986, 1991, 1992) and continued in linguistics by Köhler (2006, 2007) who used the concept of L-motifs, i.e. length motifs. Using Köhler’s approach, we define an F-segment as follows: An F-segment is that text segment which begins with the first word in the given text and consists of word frequencies which are equal to or greater than the left neighbour. The Bulgarian sentence in Section 11.1 given as 1 4 18 1 4 5 1 4 consists of the following F-segments: 1-4-18, 1-4-5, 1-4. Every text can be completely partitioned into F-motifs. They can be regarded, too, as a kind of rhythmic units having different properties. Not every F-motif needs to begin with frequency 1 but the majority of them do. What is more, even the F-motif beginnings follow a special distribution. The Fmotifs are very abstract units not allowing the reconstruction either of language or – to an even lesser degree – of text. In “postpositional” languages, the F-motif can end at a postposition, in “prepositional” languages – at a preposition. In languages having articles it begins with an article, etc. A very wide-ranged investigation would be necessary in order to find the relationship of the properties of F-motifs to other properties of text or language. We
Frequency motifs
219
consider F-motifs as something like irregular beats, following nevertheless some regularity. For the sake of illustration consider all F-motifs in the complete Bulgarian text B-10 given in Table 11.7. As can be seen, they have different numbers of components (different lengths), they occur with different frequencies, i.e. they are repeated in the text, and if we join the first element of the F-motif with its last one using a straight line, the slope can be very steep. The Fmotifs have a special TTR regime and they have a rank frequency distribution as well as a spectrum. Let us consider below these properties step by step. Another aspect of F-motifs is their dependence in the sequence. Perhaps they form Markov chains of some order. Since they are repeated forming a time series, their autocorrelation would be of interest. Drawing them in form of internal straight lines, they represent a fractal having a certain fractal dimension. The coefficient of Hurst could inform us about the persistence of this series, etc. As can easily be seen, both Köhler’s L-motifs as well as the F-motifs open up a new field of investigation, whose boundaries cannot even be estimated.
220
Word frequency and position in sentence
Table 11.7: F-segments of text B-10 F-motif 1 1,1,1,1,1,1,2 1,1,1,1,1,8,25 1,1,1,1,2 1,1,1,1,2,4 1,1,1,1,26 1,1,1,1,3 1,1,1,1,4 1,1,1,18 1,1,1,2 1,1,1,2,25 1,1,1,2,4 1,1,1,23 1,1,1,23,26 1,1,1,3 1,1,1,4,18 1,1,1,6 1,1,1,7 1,1,18 1,1,18,25 1,1,2 1,1,2,2,2 1,1,2,23 1,1,2,4 1,1,2,5 1,1,2,6 1,1,23 1,1,25 1,1,26 1,1,3 1,1,4 1,1,4,25 1,1,6 1,1,7
Occurrences
Length
F-motif
1 1 1 1 1 2 1 1 2 3 1 1 1 1 3 1 1 1 5 1 2 1 1 1 1 1 6 2 3 1 1 1 1 1
1 7 7 5 6 5 5 5 4 4 5 5 4 5 4 5 4 4 4 4 3 5 4 4 4 4 3 3 3 3 3 4 3 3
1,3 1,3,18 1,3,26 1,3,3 1,3,5,6 1,3,9 1,4 1,4,5 1,4,6,8 1,5 1,6 1,6,18 1,6,8 1,7 1,9 1,16 1,18 1,23 1,25 1,26 2 2,2 2,2,2 2,2,3 2,2,9,18 2,3 2,3,18 2,3,8 2,3,9 2,4 2,5 2,5,18 2,6 2,7
Occurrences
Length
4 2 1 3 1 3 1 3 1 4 1 3 9 2 1 3 1 4 2 2 4 2 1 3 2 3 3 2 1 2 1 2 11 2 7 2 10 2 14 2 14 1 1 2 1 3 1 3 1 4 1 2 1 3 1 3 1 3 1 2 1 2 1 3 2 2 1 2 (continued on next page)
Frequency motifs
221
Table 11.7 (continued from previous page) F-motif
Occurrences
Length
1,1,8 1,1,9 1,2 1,2,2,3 1,2,2,6,8 1,2,18 1,2,25 1,2,26 1,2,3,25 1,2,5 1,2,5,23 1,2,7 1,2,8,18 1,2,9,18 1,23,25 1,23,26
1 2 5 1 1 1 2 3 1 1 1 1 1 1 1 1
3 3 2 4 5 3 3 3 4 3 4 3 4 4 3 3
F-motif 2,9 2,9,18 2,23 2,25 2,26 3 3,4 3,5 3,23 4 4,4 4,6 4,25 6,18 18
Occurrences
Length
1 1 3 1 1 9 1 1 1 12 1 1 2 1 7
2 3 2 2 2 1 2 2 2 1 2 2 2 2 1
Let us first consider the rank frequency distribution of F-motifs. If they are prolific linguistic units, their rank frequency distribution must abide by some distribution taken from the general theory, most probably one of those used for words. In Table 11.8 the frequencies are ranked and the Zipf-Mandelbrot distribution is fitted to them. Table 11.8: Fitting the Zipf-Mandelbrot distribution to the rank frequency of Fsegments Rank r
f (r)
NP(r)
1 2 3 4 5 6 7
14 14 12 11 10 9 9
18.25 13.98 11.35 9.56 8.28 7.30 6.53
Rank r
f (r)
NP(r)
Rank r
f (r)
34 35 36 37 38 39 40
1 1 1 1 1 1 1
1.76 1.71 1.67 1.63 1.59 1.55 1.52
67 1 0.95 68 1 0.93 69 1 0.92 70 1 0.91 71 1 0.90 72 1 0.87 73 1 0.88 (continued on next page)
NP(r)
222
Word frequency and position in sentence
Table 11.7 (continued from previous page) Rank r
f (r)
NP(r)
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
7 7 6 5 5 4 4 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1
5.91 5.40 4.98 4.62 4.30 4.03 3.79 3.58 3.39 3.22 3.07 2.93 2.81 2.69 2.58 2.48 2.40 2.31 2.23 2.16 2.09 2.06 1.97 1.91 1.86 1.81
Rank r
f (r)
NP(r)
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1.48 1.45 1.42 1.39 1.36 1.34 1.31 1.29 1.26 1.24 1.22 1.20 1.18 1.16 1.14 1.12 1.19 1.08 1.07 1.05 1.03 1.02 1.00 0.99 0.97 0.96
Rank r
f (r)
NP(r)
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.86 0.85 0.84 0.83 0.82 0.81 0.80 0.79 0.79 0.78 0.77 0.76 0.75 0.74 0.74 0.73 0.72 0.71 0.71 0.70 0.69 0.69 0.68 0.67 0.67 0.66
a = 0.9520, b = 2.0880, n = 99, X 2 = 11.14, DF = 77, P ≈ 1
The fit is almost perfect, as can also be seen from the graphical representation in Figure 11.3. The Zipf-Mandelbrot distribution is not, however, the only adequate model: Even the right truncated zeta distribution, which has one parameter less, and also the negative hypergeometric distribution yield very satisfying results (with P ≈ 1.0). Though the empirical distribution is very flat and different other distributions would be appropriate, the adequateness of the Zipf-Mandelbrot distribution is, so to say, a warrant of fruitfulness of future research on F-motifs.
Frequency motifs
223
Figure 11.3: Fitting the Zipf-Mandelbrot distribution to ranked F-motifs
The spectrum of F-motifs is also very regular. This can be seen from Table 11.9 which presents the results of fitting the right-truncated negative binomial distribution. Table 11.9: The spectrum of F-motifs in B-10 Frequency
f (i)
1 2 3 4 5 6 7
70 9 6 2 2 1 2
NP(i) 69.07 9.15 4.96 3.37 2.51 1.98 1.61
Frequency
f (i)
8 9 10 11 12 13 14
0 2 1 1 1 0 2
NP(i) 1.35 1.14 0.99 0.86 0.75 0.66 0.59
k = 0.1392; p = 0.0483; X 2 = 3.71; DF = 7; P = 0.81
Its graphical presentation is in Figure 11.4. Needless to say, some other distributions from the family were adequate, too (Johnson-Kotz, Hyperpascal, Zeta, right-truncated Zeta, Waring, negative hypergeometric).
224
Word frequency and position in sentence
Figure 11.4: Spectrum of F-segments in B-10
So far, F-motifs behave like lexical or rhythmic units, that is, the theory of word frequency distributions holds true in this domain. Of course further corroborations are necessary. Let us consider now the length or size of F-motifs, which can be stated using Table 11.7 (see above, p. 220). The size of an F-motif can be measured in number of the units constituting it. Again, we suppose that if they are word-like, they follow some distribution known from word-length theory (c.f. Grotjahn & Altmann 1993; Wimmer et al. 1994; Wimmer & Altmann 1996; Best 2001; Grzybek 2006). Collecting data from Table 11.7 we obtain the data in B-10 in Table 11.10. Table 11.10: Frequency distribution of F-motifs Size of F-motif x 1 2 3 4 5 6 7
Number of different F-motifs fx 5 29 31 21 10 1 2
NPx 4.97 28.64 32.38 20.30 8.80 2.92 1.00
a = 1.4067;b = 0.2439; X 2 = 0.47; DF = 3; P = 0.93
Without doing premature generalizations, we use the Hyper-Poisson distribution which is the basic distribution in this domain (cf. Best 1997, 2001).
Frequency motifs
225
The fitting results are presented graphically in Figure 11.5. This result is just empirical, the computation was performed on the data from one text only, but it corroborates the hypothesis that the distribution of word frequency motifs – similarly to Köhler’s word lengths motifs – abide by a law, and, as such, it is another piece of evidence that word frequency motifs in text are of a rhythmical nature.
Figure 11.5: Distribution of length of the F-motifs in B-10
In Chapter 10 we already studied the relation of frequency to length. If we consider F-motifs as some text units, then there must be a power dependence between these two variables, frequency being the independent and length the dependent variable. Of course, for a given frequency, we must determine the mean length of the F-motifs. Using again Table 11.7, we obtain the results in Table 11.11. As can be seen, F-motifs behave like usual text units fulfilling also this condition: the length of F-motifs is a power function of their frequency. The dependence is not very smooth because some frequency classes are not sufficiently represented. But in longer texts we can expect a smooth course.
226
Word frequency and position in sentence
Table 11.11: Dependence of mean F-motif size on frequency Frequency 1 2 3 4 5 6 7 9 10 11 12 14
Mean F-segment size 3.41 3 3 2 3 3 1.50 1.50 2 2 1 1.50
Expected value 3.68 2.98 2.63 2.41 2.25 2.13 2.03 1.88 1.82 1.77 1.73 1.65
y = 3.6787x − 0.3045; R2 = 0.61
The curve is presented in Figure 11.6.
Figure 11.6: The dependence of the size of F-motifs on their frequency
Distances between hapax legomena
11.7
227
Distances between hapax legomena
Above we studied the runs, dividing the F-motifs into two classes: short and long. Here we shall consider only hapax legomena, i.e. the occurrence of words having frequency 1. All other words belong to another category (say 0). We assume that distances between hapax legomena are controlled by a mechanism depending on language, genre, author, style etc. Of course, only very intensive research could allow us to predict the kind of distance control. Nevertheless, we shall try to show the first step. Let us consider sentence 37 from B-10 as shown in Table 11.1 and in Chapter 11.3: 1,8,25,1,26,1,1,1,1,2,1,1,1,23,26,1,23,26,1,1,2,2,2,1,1,25,4,2,1,4,1,4,1 The distance between two 1’s is measured as the number of other categories between them, i.e. one can transcribe the above sequence as 100101111011100100110001100010101 which differs from that used for counting runs. Distance 0 is present in the sequence 11, distance 1 in the sequence 101, i.e. the distance between two hapax legomena (1) is measured in terms of the number of zeros between them. The above series can be considered a Markov chain – as proposed by Brainerd (1976) and evidenced for other formal units (cf. Strauß et al. 1984). The order of a Markov chain is determined by the number of predecessors influencing the probability of a successor. Since the distances between hapax legomena can be expressed in terms of transitions, we define their probabilities as follows (Y = distance) P(Y = 0) = P(Sx = 1|Sx−1 = 1) = P(1|1)
(11.14)
i.e. the probability of distance zero is given by the probability that in the sequence S in position x a 1 appears under the condition that in position x − 1 there is also a 1. All other distances are given in the following way: a 1 is followed by k zeros which are followed by a 1, i.e. 1 000 . . . 00} 1 | {z k
yielding one transition from 1 to 0, k − 1 transitions between zeros and one transition between zero and 1. This can be written as (distance k) P(Y = k) = P(0|1)P(0|0)k−1 P(1|0).
(11.15)
228
Word frequency and position in sentence
Hence we obtain ( P(1|1) for k = 0 P(Y = k) = k−1 P(0|1)P(0|0) P(1|0) for k = 1, 2, . . .
(11.16)
Since P(1|1) is the complement of P(0|1) and P(0|0) is the complement of P(1|0) we can write the formula simpler as ( 1−α for k = 0 P(Y = k) = (11.17) k−1 α pq for k = 1, 2, . . . (q = 1 − p) representing the modified geometric distribution (cf. Wimmer & Altmann 1999:420). Computing the distances between hapax legomena in B-10 we obtain the results in Table 11.12. The theoretical values computed according to (11.17) are in the last column of the table. The parameters have been determined iteratively; different estimators can be found in Strauß et al. (1984). The fit is presented graphically in Figure 11.7. Table 11.12: Distances between hapax legomena in B-10 as a Markov chain of first order Distance y 0 1 2 3 4 5 6 7 8 9 10
Number of occurrences 87 78 38 21 11 4 1 2 0 0 1
NPy 86.26 78.25 39.18 19.62 9.82 4.92 2.46 1.23 0.62 0.31 0.31
q = 0.5008; a = 0.6450 X 2 = 1.84; DF = 6; P = 0.93
As can be seen in Figure 11.7, the fit is almost perfect. This fact must, of course, be tested on different texts. The fact that a Markov chain of first order is adequate to express the distance between hapax legomena should
Distances between hapax legomena
229
Figure 11.7: Markov chain of first order for the distances of hapax legomena
not tempt us to suppose that Markov chains of higher order are adequate for distances of more complex (= with longer dependences) binary phenomena. As can easily be seen, if we take higher orders, the geometric distribution will be further modified step by step and can capture any serial dependence. But a formula capturing everything does not explain anything. Hence we rather recommend attributing formal elements to two classes: those whose distances can be captured by (11.17) and those that cannot. For those that cannot, an urn model or a stochastic process leading to the negative binomial distribution should be used (cf. Strauß et al. 1984).
12 12.1
The type-token relation
Introduction
One of the oldest problems of textology is that of the type-token relation beginning on the level of phonemes and graphemes and ending somewhere on the level of hrebs (sentence aggregates). Take any inventory of linguistic units and take a text. The position of units in the text is numbered from 1 to N and forms the independent variable. The dependent variable y can be set to change with the appearance of a new unit at position x, i.e. a unit from the inventory that has not yet occurred up to position x. A very simple sequence of y values can be formed: each time a new unit occurs the value of y is increased by 1. Take for example the title of this chapter “the type-token relation” and let units be the letters. We obtain the following result shown graphically in Figure 12.1. t h e t y p e t o k e n r e l a t i o n Position (tokens) x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sequence (types) y 1 2 3 3 4 5 5 5 6 7 7 8 9 9 10 11 11 12 12 12
Figure 12.1: TTR of a letter sequence
232
The type-token relation
The first repetition is at position x = 4 where the types-sequence y does not increase because the letter-type “t” already occurred. The higher the units in the hierarchy of units, the smaller their repetition, i.e. the sequence is steeper. When studying letters as units, we know that the asymptote of y must be the number of letters in the inventory, e.g. 26 in English. Even with syllables, the problem of finding this kind of asymptote is not unsolvable. But with morphemes and words we abut on a seemingly unsolvable problem. How many morphemes and words are there (cf. Kornai 2002)? If in statistics the convergence to normality is achieved very quickly, what does twenty millions in language mean? This is namely the number of words which are estimated to exist in German (including terminology). Thus, working with the concept of an infinite number of words is theoretically quite legal and models of e.g. word stock growth must use it because we cannot assume that a time will come in which no new words will be coined. However, the situation in textology is slightly different. There are no infinite texts! Corpora consisting of one billion words or novels consisting of several chapters are no texts but text mixtures. All texts are finite and have a finite vocabulary size V . Thus the type-token function y must either have the asymptote V , or V will be empirically achieved long before position N and there is no increase any more because all following words are repetitions. The curve can increase but need not surpass V at position N. This fact boils down to a somewhat pessimistic conclusion: all type-token models taking into account an infinite number of words are theoretically unsatisfactory even if they yield excellent fits. Many of them have been derived in congenially sophisticated manner starting from assumptions which, later on, turned out to be evidently false. There is an enormous number of publications, we refer here to bibliographies (cf. Köhler 1995) even if they cannot provide up-to-date information. Let us begin with qualitative questions. What does a type-token function indicate or represent and why is its study interesting? First of all, it has nothing to do with vocabulary richness. The strong dependence of vocabulary on text length does not allow us to say that a text with 800 types is lexically poorer than a text with 1000 types. Statements of this kind are possible only if one devises a test in which the TTR function is given a priori and the numbers of types in individual texts are considered as distances from the theoretical curve. But there is no theoretical function; the parameters are different for every text – √ even if we have the same function type. The original “richness” formula V / N has been modified many times
Introduction
233
in order to eliminate the influence of N and some of the modifications yield excellent fits as TTR functions, but not as “richness” curves. Another interpretation considers TTR as the number of words a writer has at his disposal for writing the given text. As a matter of fact, all words he knows are at his disposal but because of restrictions on text length, thematic restrictions, genre restrictions (e.g. rhythmic bounding in poems), etc. he can not use more. If boundary conditions cannot be taken into account, the interpretation is not reasonable. The same holds of the opinion that TTR is a measure of the word stock of a writer. Writers usually know all words of their language (minus scientific terminology and technical vocabularies) – just as every adult – and they play with words. Every text of a writer leads to a different prediction of “his word stock”, hence this interpretation is also somewhat unsatisfactory. A very nebulous concept of a writer’s “active vocabulary” is not helpful either. It is sufficient to ask to which extent a certain word is active in order to see that different situations activate different words. Thus activity is rather a quantity whose ascertainment cannot be achieved by binary decisions. Now, what is TTR? Consider first the topic of a given text. The topic is a concept, nothing more. Concepts can be expressed with one word (“girl”) or with two words (“nice girl”) or with a phrase (“nice smiling and dancing girl”) or with a sentence (“A nice girl stands in the door and smiles”). From a linguistic point of view we speak of different units or levels or modes of expression but epistemologically we have only a more or less specified concept. Elaborating on this idea, a text, e.g. a novel is also only a concept but a very well specified one (cf. Köhler & Altmann 1993). The main concept has a number of primary predicates, these have further predicates etc., they all are connected with auxiliaries and references and the whole results in a text. The main concept can be simple or composite, e.g. “a king” or “a king and his three sons”. When we begin reading a text, we see that the main concept is explicated by its predicates in different pace. The number of predicates of different order can follow in a quick variety or some of them can be repeated and the deployment of the topic can be slowed down on psychological or text processing grounds. In other words, the information can be conveyed slowly or quickly. “Information” as used here is not the well known concept from information theory but simply the content, the topic: the main concept and its (social, cultural, political, real, emotional, fictional,. . . ) environment. The TTR function shows this deployment. The TTR function is not a measure of richness but a measure of the information flow, topic deployment.
234
The type-token relation
According to Orlov (cf. Orlov, Boroda and Nadarejšvili 1982), the writer “plans” a certain length of the text – Orlov called it Zipf size – and the TTR function develops on the basis of this size. The information deploys slowly if the size is great, and quickly if it is small. In many cases the assumption is correct but there is a difference between planned size and realized size, even though Orlov gives a derived formula. As a matter of fact, all known formulas capture the TTR data more or less exactly – not caring for asymptotes or upper bounds – and almost all concern the plain counting as shown above. But there are even different possibilities of counting/measurement if we do not identify the TTR function with vocabulary richness. We shall show some of them in the following chapters. We already mentioned the lemmatisation problem yielding an impetus to never ending discussions. We shall ignore it here but we cannot ignore the problem of uppercase and lowercase letters. If a TTR program makes a difference between them, it automatically yields false results (the same word can be considered two different words). The well-known Georgetown counter automatically changes all letters to the uppercase ones, which is a correct procedure for English but not for German, cf. e.g. “der Gefangene floh” (the captive flew) vs. “der gefangene Floh” (the trapped flea). Another problem is that of homonyms. If we consider TTR as information flow, homonyms should be distinguished. But since we consider the flow of word forms, the disturbance caused by homonyms can be neglected. The researchers must consider the weight of homonymy in every language separately.
12.2
Standard measurement
Out of the enormous number of possible functions (concave, monotonously increasing) that appeared in the literature let us use those which have a finite asymptote (V ) and follow from the general theory (cf. Wimmer & Altmann 2005). These two requirements considerably reduce the number of available models. Since we consider word forms here and all our texts are prepared in this manner, everything that follows will be restricted to word forms. The standard measurement itself can have two variants: (1) All positions are taken into account and (2) only those positions at which there is an increase. In our example above we would obtain for case (2)
Standard measurement
235
Position (tokens) x 1 2 3 5 6 9 10 12 13 15 16 18 Sequence (types) y 1 2 3 4 5 6 7 8 9 10 11 12 yielding a “smoother” curve, as can be seen in Figure 12.2.
Figure 12.2: Measuring only increases
For word forms, we start from the general theory and assume that the linear growth of new forms – fulfilling the “application requirement”, which represents the fact that existing linguistic material, in this case words or word forms, will be applied to express the concepts to be transmitted via the given text, i.e. the more different concepts, the more different words – must be braked by the slightly smaller requirement of decoding effort and transmission security. Writing these assumptions together we obtain dy = y
1 1 dx − x x+b
(12.1)
yielding ln y = ln x − ln(x + b) +C. Solving for y, the result is y =
ax x+b
(12.2)
236
The type-token relation
where a = eC . This is the well-known Tornquist curve, introduced for this purpose by Tuldava (1974, 1998). The asymptote of this function is a = V . However, in most cases we can state that if we fix the asymptote, the function does not yield a good fit because it approaches V at infinity. However, if we let a > V , contradicting reality, the fit is better, mostly perfect. Consider the fitting to the Hungarian text H-04. In Figure 12.3 we have the data and the curve y = 413x/(241.53 + x) with R2 = 0.85; in Figure 12.4 we have the data and the curve y = 1238.57x/(1592.07 + x) and R2 = 0.9988. The visual difference is very great. In Figure 12.4 the theoretical and the empirical points almost coincide but the asymptote is empirically false. The function could be modified in different ways but that would produce only one more function. Another disadvantage of this function is the fact that if in (12.2) x = 1, y must be 1, too. In that case 1 = a/(b + 1), i.e. b = a − 1, but in none of the two functions do we find this. The function in Fig. 12.4 has for x = 1 a value which is even less than 1.
Figure 12.3: TTR for H-04 with fixed asymptote (a = 413, Tornquist)
From the great number of other functions we choose the simplest one here, namely Herdan’s (1966), based on the assumption that the relative increase rate of new types is proportional to the relative increase rate number of tokens, i.e. dy dx = b y x
(12.3)
Standard measurement
237
Figure 12.4: TTR for H-04 with optimized asymptote (Tornquist)
resulting in y = axb
(12.4)
where a is the integration constant. Since for x = 1 also y = 1, we get a = 1 and the curve has the form y = xb . (12.5) Fitting (12.5) to H-04 we obtain the result in Figure 12.5.
Figure 12.5: Fitting (12.5) to H-04
The discussion of other functions, methods and tests can be found in Wimmer & Altmann (1999) and Wimmer (2005). In Table 12.1 we present the two functions for several texts in several languages.
238
The type-token relation
Table 12.1: Tornquist’s and Herdan’s TTR curves applied to different texts Text
a
Tornquist b
R2
b
Herdan R2
G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-12 G-13 G-14 G-15 G-16 G-17 H-04 H-01 H-02 H-03 H-05 In-01 In-02 In-03 In-04 In-05 M-01 M-02 M-03 M-04 M-05 T-01 T-02 T-03
2394.58 1005.58 840.38 668.19 1108.02 1037.62 802.93 1673.33 1245.51 1262.54 712.79 855.99 489.68 1481.34 712.07 282.89 1238.57 2937.66 2393.13 1227.98 1434.09 476.55 610.89 455.46 647.08 413.92 698.22 528.55 430.47 480.53 796.53 2274.25 2266.64 1816.49
4049.73 1511.71 1041.48 816.39 1350.59 1208.60 1005.22 2305.38 1537.87 1546.40 814.62 1081.31 534.03 1722.45 767.42 313.06 1592.07 5181.32 2632.74 1682.19 2012.87 562.12 740.55 466.56 702.61 462.10 1795.83 1041.85 896.32 838.34 2318.80 4360.06 4114.96 3718.90
0.9951 0.9988 0.9976 0.9986 0.9989 0.9992 0.9994 0.9994 0.9988 0.9985 0.9995 0.9987 0.9991 0.9993 0.9982 0.9975 0.9988 0.9985 0.9985 0.9990 0.9994 0.9964 0.9983 0.9994 0.9993 0.9973 0.9965 0.9972 0.9960 0.9957 0.9908 0.9972 0.9977 0.9994
0.8935 0.8824 0.9117 0.8993 0.9921 0.9260 0.9228 0.9095 0.9211 0.9296 0.9343 0.9116 0.9351 0.9372 0.9190 0.9003 0.9114 0.8797 0.8949 0.9048 0.9081 0.9151 0.9081 0.9164 0.9280 0.8890 0.7855 0.8100 0.7839 0.8051 0.7683 0.8732 0.8768 0.8542
0.9970 0.9900 0.9890 0.9800 0.9920 0.9900 0.9980 0.9940 0.9920 0.9960 0.9960 0.9910 0.9940 0.9940 0.9750 0.9700 0.9890 0.9900 0.9880 0.9930 0.9960 0.9910 0.9870 0.9700 0.9860 0.9580 0.9700 0.9680 0.9440 0.9160 0.9520 0.9980 0.9950 0.9930
Köhler-Galle method
239
The standard measurement yields a smooth function evoking the illusion that we captured aspects of vocabulary richness. Visually, the parameters of Herdan’s curve move in a very narrow interval but a test would easily show that the differences are significant. This is not so much a data problem but the problem of the t-test, for which our sample sizes are too large. Since Herdan’s function is the simplest one and the parameter b is relatively stable we can accept it as a preliminary model. Of course, a number of questions can be asked concerning b but they are easier to answer than problems concerning the parameters of Tornquist’s curve.
12.3
Köhler-Galle method
Conveying information does not mean only the sequence of types which specify the theme but depends also on the absolute size of the vocabulary V (= the number of all types). It can happen that the types are concentrated in the first part of the text and the rest repeats the same words. Hence Köhler and Galle (1993) proposed an index taking into account both the vocabulary V and the text length N. Let x be the position in text (x = 1, 2, . . . , N), yx the number of types up to position x (yx = 1, 2, . . . ,V ), the Köhler-Galle TTR measure is z=
yx +V − xV /N . N
(12.6)
Since this is merely a slight transformation of data, z can be computed from z=
xb +V − xV /N = a(xb − cx) + d. N
(12.7)
Optimizing the parameters one obtains values which differ from those obtained from V and N but the function yields a good fit. The computation is shown using Goethes Erlkönig in Figure 12.6. The computed function (12.7) is z = 0.0147(x0 .9090 − 0.6101x) + 0.5419 (12.8) yielding R2 = 0.93. As can be seen in Figure 12.6 the values of z describe a slightly oscillating asymmetric arch, which gives a better impression of the irregularities of TTR than the original TTR. The oscillation is the smaller the longer the text.
240
The type-token relation
Figure 12.6: Fitting (12.8) to the TTR of Erlkönig
12.4
The ratio method
Different researchers prefer computation using the transformation z =
yx x
(12.9)
but this results in a very rhapsodic “curve” whose beginning is a straight line up to the second type (cf. Wimmer 2005, 27.3) and then oscillations begin. Theoretically, we should obtain z =
yx xb = = xb−1 x x
but this curve can be adequate only in very seldom cases. However, if we use the Tornquist function, we obtain z =
a b+x
(12.10)
but in order to begin with z(1) = 1, we solve ( (12.10)) and obtain b = a − 1, thus a z = . (12.11) a−1+x Fitting this function to the data of Erlkönig we obtain z=
242.7957 241.7957 + x
Stratified measurement (the window method)
241
Figure 12.7: Fitting (12.11) to the data of Erlkönig
yielding R2 = 0.96. The fitting is displayed graphically in Fig. 12.7. It is to be noted that there are many functions yielding a better fit but at the price of more parameters and the difficulty of finding a theoretical foundation. A better fit could be achieved by pooling some x values and taking averages of intervals but there is no unified pooling method up to now and it is questionable whether such a method can be found. There is a great difference between making 20 intervals for a text of length N = 100 and a text of length N = 10000. The original picture conveyed by TTR measures as information flow would completely disappear.
12.5
Stratified measurement (the window method)
A similar problem is connected with the stratified measurement which aims at the measurement of the stability of TTR. Here one computes the increase of types for x = 1, 2, . . . , K (taking an arbitrary K), then x = K + 1, . . . , 2K, x = 2K + 1, . . . , 3K, etc. This means, one takes equal passages of K words and counts the TTR for each passage separately. There is no theoretical foundation for the size of the passage but the aim here is not to model the information flow but only its stability. In practice this means that one compares either the slopes of the partial TTR segments or all segments by means of Cochran’s Q-test, etc. Since we assume the validity of Herdan’s function, its slope at x = 1 is b, hence we must test the homogeneity of all b’s. Taking the logarithm of Herdan’s function we get
242
The type-token relation
log y = b · log xor Y =b·X
i.e. the logarithmic values yield the usual linear regression without the additive constant. For testing one can use the familiar F-test
F(n1 , n2 ) =
K
k
j=1
i=1
¯ 2 /n1 ∑ X j2 ∑ (bi − b)
k
K
(∑ ∑
i=1 j=1
Yi2j −
K
∑
j=1
X j2
k
∑
i=1
(12.12)
b2i )/n2
where bi are the individual slopes of all k strata (windows), K is the equal length of all strata, b¯ is the mean of all bi ’s, n1 = k − 1 and n2 = k(K − 1). Another possibility is to stepwise test the differences b1 − b2 , b2 − b3 , . . . , or to test first b1 − b2 , then pool b1 and b2 and test it against b3 , . . . . We shall perform the test using formula (12.12). To this end we use Toni Morrison’s Nobel lecture E-02. We partition the text in 9 subsequent parts each containing 300 words and eliminate the rest of the text end. First we compute the parameter b of Herdan’s function and obtain the result (Pi = part i). P1 P2 P3 P4 P5 P6 P7 P8 P9
0.8927 0.8866 0.9275 0.9037 0.9084 0.9071 0.9097 0.8996 0.9023
The parameters seem to be very similar and at a first glance one would intuitively accept the null hypothesis of their equality. But a test can provide more accurate results. Since the complete table with all parts would be too long we reproduce only the results of the calculations.
Stratified measurement (the window method)
k = 9, n1 = 8,
243
K = 300, n2 = 9(299) = 2691,
9
∑ bi = 8.1376;
b¯ = 0.9042,
i=1 9
¯ 2 = 0.001629, ∑ (bi − b)
i=1 300
∑ (ln Xi )2 = 6951.9342,
i=1 9 300
∑ ∑ (lnYi j )2 = 51801.4883,
i=1 j=1
hence the test yields 6951.9342(0.001629)/8 = 5.93. [51801.4883 − 6951.9342(7.3589)]/2691
Since an F0.005 (8, ∞) = 2.744 we can safely reject the null hypothesis of homogeneity of the parameters bi , i.e. the slopes are not equal, the curves are not parallel. In practice this means that somewhere in the text there are structural breaks of the TTR. We can suppose that the text was not written in one go, there were pauses in writing or much editing. The search for breaks in a text is a special discipline, which will not be treated here. Nevertheless, we can track down a possible break performing the above mentioned method of testing the difference between the parameters b of subsequent passages. Unfortunately, there is no objective method for determining the length of the passage. A significant difference does not mean that the break is exactly between the passages but simply the fact that they are not parallel and in one of them something happened. One can even use passages of different length. Let the position in the passage be the independent variable x and y the TTR value (number of types up to x). After a logarithmic transformation of both variables in order to obtain linear regression, we use the formula b1 − b2 tK1 +K2 −4 = v (12.13) u 1 1 u su K1 + K2 t (X − X¯ )2 ∑ 1i 1 ∑ (X2i − X¯2 )2 i=1
i=1
where K1 and K2 are the passage lengths and s2 =
(K1 − 2)s21 + (K2 − 2)s22 K1 + K2 − 4
(12.14)
244
The type-token relation
and si 2 is simply the variance of the dependent variable Yi (i = 1, 2). Since we take equal passage lengths, the formulas can be simplified. Let us compare the first two passages in Morrison’s Lecture E-02 having b1 = 0.8927 and b2 = 0.8866. Our samples are of equal length K = 300, hence we have 300
¯ 2 ∑ (Xi − X)
= 278.7390 ,
i=1
from which we get r
2 = 0.0847 , 278.7390
and s21 = 200.6719,
s22 = 212.4406 ,
from which we obtain s2 = 206.5563
and s = 14.3721 .
Since t has DF = 596 degrees of freedom, the difference is not significant. The first two TTR lines are parallel. This means that the break is in some other passage. The problem will not be pursued here further.
12.6
The TTR of F -motifs
In Chapter 11 we introduced the F-motifs (see Uhlíˇrová 2007) and found that as to their frequency properties they behave like words. Here we shall test their TTR properties using the first variant of the standard measurement. If their TTR follows one of the two functions, we can consider them as established. The result of fitting is presented in Table 12.2 using text B-10.
The TTR of F-motifs
245
Table 12.2: Fitting the Tornquist and Herdan curves to the TTR of F-segments in B-10 Token
Type
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
1 2 3 4 5 6 7 8 9 10 10 11 12 13 13 14 15 16 17 18 18 19 19 20 21 22 23 24 25 26 27 27 28 29
Tornquist
Herdan
1.01 2.01 3.00 3.97 4.93 5.89 6.83 7.76 8.68 9.59 10.49 11.38 12.25 13.12 13.98 14.83 15.67 16.50 17.33 18.14 18.95 19.74 20.53 21.31 22.08 22.84 23.60 24.35 25.09 25.82 26.54 27.26 27.97 28.68
1.00 1.83 2.60 3.33 4.05 4.74 5.42 6.08 6.74 7.39 8.02 8.65 9.27 9.89 10.50 11.11 11.71 12.30 12.89 13.48 14.07 14.65 15.22 15.79 16.36 16.93 17.50 18.06 18.62 19.17 19.73 20.28 20.83 21.37
Token 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
Type
Tornquist
Herdan
68 68.27 60.65 69 68.63 61.11 69 68.99 61.58 70 69.34 62.04 71 69.69 62.51 71 70.04 62.97 71 70.38 63.43 71 70.72 63.90 72 71.06 64.36 73 71.40 64.82 73 71.74 65.28 73 72.07 65.74 73 72.40 66.20 73 72.73 66.66 74 73.06 67.12 75 73.38 67.58 75 73.71 68.04 75 74.06 68.49 76 74.34 68.95 76 74.66 69.41 77 74.97 69.86 77 75.28 70.32 77 75.59 70.78 78 75.90 71.23 78 76.21 71.69 79 76.51 72.14 79 76.81 72.59 79 77.11 73.05 79 77.41 73.50 79 77.70 73.95 79 78.00 74.40 79 78.29 74.86 79 78.58 75.31 79 78.87 75.76 (continued on next page)
246
The type-token relation
Table 12.2 (continued from previous page) Token
Type
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
30 30 31 31 31 32 33 33 34 35 35 36 37 37 38 39 40 41 42 42 42 43 44 44 45 46 47 48 48 49 49 49 50 51 52 52 53
Tornquist
Herdan
29.37 30.06 30.74 31.42 32.09 32.75 33.41 34.06 34.70 35.34 35.97 36.60 37.22 37.83 38.43 39.04 39.64 40.23 40.81 41.39 41.97 42.54 43.11 43.67 44.22 44.77 45.32 45.86 46.39 46.92 47.45 47.97 48.49 49.00 49.51 50.01 50.51
21.92 22.46 23.00 23.54 24.08 24.61 25.15 25.68 26.21 26.74 27.26 27.79 28.31 28.83 29.36 29.88 30.39 30.91 31.43 31.94 32.45 32.96 33.48 33.98 34.49 35.00 35.51 36.01 36.51 37.02 37.52 38.02 38.52 39.02 39.52 40.01 40.51
Token 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
Type
Tornquist
Herdan
79 79.15 76.21 79 79.44 76.66 79 79.72 77.11 79 80.00 77.56 79 80.28 78.01 79 80.56 78.45 79 80.83 78.90 79 81.10 79.35 79 81.38 79.80 79 81.65 80.24 79 81.92 80.69 80 82.18 81.14 81 82.44 81.58 81 82.71 82.03 81 82.97 82.47 81 83.23 82.92 81 83.49 83.36 81 83.75 83.81 81 84.01 84.25 81 84.26 84.69 81 84.51 85.13 81 84.76 85.58 81 85.01 86.02 81 85.26 86.46 81 85.51 86.90 81 85.75 87.34 81 86.00 87.78 81 86.24 88.23 81 86.48 88.67 82 86.72 89.11 83 86.96 89.54 83 87.20 89.98 83 87.43 90.42 83 87.66 90.86 83 87.90 91.30 84 88.13 91.74 85 88.36 92.17 (continued on next page)
The TTR of F-motifs
247
Table 12.2 (continued from previous page) Token
Type
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
53 53 53 54 54 55 55 55 55 55 55 56 57 57 57 58 58 58 58 58 58 59 60 61 61 62 63 64 64 64 65 65 65 65 65 66 66
Tornquist
Herdan
51.01 51.50 51.99 52.47 52.95 53.43 53.90 54.37 54.83 55.29 55.75 56.20 56.65 57.10 57.54 57.98 58.41 58.84 59.27 59.70 60.12 60.54 60.95 61.36 61.77 62.18 62.58 62.98 63.38 63.77 64.16 64.55 64.93 65.32 65.69 66.07 66.44
41.00 41.50 41.99 42.48 42.97 43.47 43.96 44.44 44.93 45.42 45.91 46.39 46.88 47.36 47.84 48.33 48.81 49.29 49.77 50.25 50.73 51.21 51.69 52.16 52.64 53.12 53.59 54.07 54.54 55.01 55.49 55.96 56.43 56.90 57.37 57.84 58.31
Token
Type
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
86 87 88 88 89 89 89 90 90 91 92 93 94 94 94 94 94 94 95 95 96 96 96 96 96 97 97 97 97 97 97 98 98 98 98 98 98
Tornquist
Herdan
88.59 92.61 88.81 93.05 89.04 93.49 89.27 93.92 89.49 94.36 89.71 94.79 89.93 95.23 90.15 95.66 90.37 96.10 90.59 96.53 90.80 96.97 91.02 97.40 91.23 97.83 91.44 98.27 91.67 98.70 91.87 99.13 92.08 99.57 92.28 100.00 92.49 100.43 92.70 100.86 92.90 101.29 93.10 101.72 93.31 102.15 93.51 102.59 93.71 103.02 93.91 103.45 94.10 103.88 94.30 104.30 94.50 104.73 94.69 105.16 94.89 105.59 95.08 106.02 95.27 106.45 95.46 106.88 95.65 107.30 95.84 107.73 96.03 108.16 (continued on next page)
248
The type-token relation
Table 12.2 (continued from previous page) Token
Type
109 110 111 112
67 67 67 68
Tornquist
Herdan
66.82 67.18 67.55 67.91
58.78 59.25 59.71 60.18
Token
Type
221 222 223 224
98 98 98 99
Tornquist
Herdan
96.22 96.40 96.59 96.77
108.58 109.01 109.44 109.86
The graphical representation of the data from Table 12.2 is given in Figures 12.8a and 12.8b.
(a) Tornquist function
(b) Herdan’s function
Figure 12.8: Fitting the Tornquist function and Herdan’s function to the TTR of Fmotifs
As can be seen, the Tornquist function yields an optically better result but Herdan’s function could be improved by including a multiplicative constant. Here we do not strive for obtaining the best result but for testing a law and establishing a linguistic unit. The Tornquist function yields y = 168.2773x/(165.5211 + x) with R2 = 0.995 while the Herdan function yields y = x0.0864 with R2 = 0.923, which can be improved by including a multiplicative constant in order to obtain the usual power curve yielding R2 = 0.988. Since the F-motifs behave like words with regard to TTR, we can consider them quite “legal” units.
13
Conclusions
Word frequency is a global property. It can be found on the surface of text and is a forerunner of an immense research domain. “Word” can be replaced by any linguistic unit, even by such abstract ones as F-motif (frequency motif), L-motif (length motif), T -motif (polytextuality motif), S-motif (polysemy motif), etc. Further, “frequency” can also be replaced by any property, one of which was discussed in Chapter 10. Hence, the combination “unit + property” yields – modestly counted – about 200 research objects, the majority of which has not even been touched in the literature. Further, all properties are connected with one another, some strongly, some weakly. If we restrict ourselves to 20 properties, we obtain 190 combinations of 2 properties, 1140 combinations of 3 properties etc.; this multiplied by the number of established units yields an astronomical number of research possibilities. Evidently the field of quantitative research grows exponentially the more properties and units are defined. This number can still be increased if we realize that every property can be measured in different ways. Consider for example entropy, which is defined in more than 40 ways in different sciences. And there is no end if e.g. complexity of texts becomes object of investigation. We restricted ourselves only to word frequency and showed that its consequent tackling reveals more than the simple form of its distribution handed down through decades. It reveals something about text, about author, about genre and about the given language. It turned out that by simple counting one can arrive at an independent way of measuring analyticity of language, it has been shown that old concepts like vocabulary richness, text coverage etc. can be interpreted very exactly. Some new concepts were coined simply by interpreting our measurements, like thematic concentration, crowding of autosemantics, pace filling etc. It has been shown that the F-motif, a unit created by very indirect measurement, has word-like properties. Some of the subjects, e.g. the different “points” in frequency distributions, appear here for the first time. Currently, the h-point seems to be a concept that can stimulate both linguistic and mathematical investigations. Not all of its aspects have been discussed here, in order not to burden the book. We only touched the problem of self-organization of formal units in long sentences,
250
Conclusions
stimulating to search both for other units displaying this behaviour and for the kinds of self-organization which need not consist only in forming runs. The study of symbolic motifs in text began with Markov chains; today a number of new methods is known. Each text can be represented as a graph based on different properties. As soon as we have a graph, we can formally study its properties. Many of them can be directly interpreted in linguistic terms, showing aspects of text organization. Though speakers are not aware of this process, text graphs organize themselves in a very characteristic way. The theory of rank-frequency or other distributions is nowadays a standard discipline of quantitative linguistics: there is an own family of distributions, we know the characteristic behaviour displayed using Ord’s criterion, we can define a number of indicators characterizing the degree of a property and we can give it linguistic interpretation. Needless to say, none of the problems has been investigated with all its consequences. This task would be impossible because science develops continuously and nobody can capture all texts in all languages. We are almost sure that investigation of some “extreme” languages or some dead languages will bring quite new aspects in this discipline. In any case, we hope that this book brings further new stimuli either for the development or for the correction of existing views. In this concluding chapter, we want to give some recommendations for further research and mention some problems worth of being considered. Any of the chapters of this book can be applied to texts from different points of view: – One can deepen any problem by studying texts in a historical perspective. How do individual properties change over centuries, how did they change in the last century, how do they change in reaction to political events? This study can be restricted to one language or one can perform parallel investigations in several languages. – The ontogenesis of text sorts may be followed up by studying old manuscripts, primitive forms, or different styles in their development. The qualitative development can be expressed quantitatively. We warn against premature generalizations. Text sorts can be different in different languages. Text sorts are no constants; they develop, too, and have a remarkable dispersion; their boundaries are fuzzy. – Individual texts (authors, styles) can be better distinguished by means of some frequency indicators than by means of qualitative indicators.
Conclusions
251
Quantitative comparisons are preceded by explicitly testable hypotheses and a qualitative interpretation of the results. – Develop each problem applying it to further texts and to further languages. We shall never obtain a definitive proof of a hypothesis but every new text can bring further corroboration. – Develop the mathematical apparatus, if possible and if necessary. Remember that the famous Markov chains originated in the study of texts. The number of individual problems is, as a matter of fact, infinite; it grows in time. None of the problems will ever be forgotten but perhaps reformulated. Even old problems can give new impulses and lead to new evaluations. Since even linguists have some language barriers, we recommend reviving obsolete problems formulated and published in “non-accessible” languages. The present book is part of a greater project dedicated to text analysis. It contains the necessary bases which will be further developed in the next volumes. To give a preliminary preview, the next part of the project will contain a new view of the Zipf law based on stratification, a modified computation of the h-point, the angle of the h-point, the golden section will be touched, a ternary plot of text classification will be presented, the use of arc length for typological considerations together with different indicators which may be used alternatively, tests for their differences and their relationships will be proposed. Further some indicators of language levels which signalize some language constants, the study of hapax legomena, the nominal style and many other issues will be elucidated. The last part of the project will consist of a thorough analysis of a restricted group of texts containing stylistic, developmental and associative considerations.
14
Appendix: Texts
Bulgarian (Ludmila Uhlíˇrová) For the South Slavic language of Bulgarian, ten private letters served as text material; they were written by and addressed to persons (both men and women) with university education, mostly on business topics. The owner of the letters gave consent to use them for linguistic analysis; some necessary steps were taken to anonymize the texts. B-01 B-02 B-03 B-04 B-05
Boris2 Ceneva1 Ceneva2 Janko1 Janko3
B-06 B-07 B-08 B-09 B-10
Janko4 Jorn2 Saša Živko1 Živko2
Czech (Ludmila Uhlíˇrová) With regard to the West Slavic language of Czech, ten short stories by Bohumil Hrabal (1914–1997), one of the most important Czech writers of the second half of the 20th century, were analyzed. The texts are taken from Vol. 3 of his collected writings:Jarmilka, Pražská imaginace: Prague 1992. Cz-01 Cz-02 Cz-03 Cz-04 Cz-05 Cz-06 Cz-07 Cz-08 Cz-09 Cz-10
Hrabal 310: Expozé panu ministru informací (Jarmilka, 44–47) Hrabal 315: Lednová povídka (Jarmilka, 58–61) Hrabal 316: Únorová povídka (Jarmilka, 62–69) Hrabal 319: Blitzkrieg (Jarmilka, 86–87) Hrabal 323: Protokol (Jarmilka, 129–131) Hrabal 325: Trat’ˇcíslo 32a (Jarmilka, 152–156) Hrabal 326: Fádní stanice (Jarmilka, 157–162) Hrabal 328: Oˇcekávej mˇe (Jarmilka, 170–172) Hrabal 329: Dˇetský d˚um (Jarmilka, 173–174) Hrabal 330: Všední hovor (Jarmilka, 175–178)
254
Appendix: Texts
English (I.-I. Popescu) For English, thirteen Nobel Lectures from different years have been taken as text material from: http://nobelprize.org/nobelprizes/lists/all/ E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13
Jimmy Carter Toni Morrison George C. Marshall James M. Buchanan Jr. Saul Bellow John Macleod Sinclair Lewis Ernest Rutherford Bertrand Russell Linus Pauling Frederick G. Banting Pearl Buck Richard P. Feynman
Nobel lecture (Peace 2002) Nobel lecture (Literature 1993) Nobel lecture (Peace 1953) Nobel lecture (Economy 1986) Nobel lecture (Literature 1976) Nobel lecture (Medicine 1925) Nobel lecture (Literature 1930) Nobel lecture (Chemistry 1908) Nobel lecture (Literature 1950) Nobel lecture (Peace 1963) Nobel lecture (Medicine 1925) Nobel lecture (Literature 1938) Nobel lecture (Physics 1965)
German (I.-I. Popescu) For German, 17 literary texts (of different genres), written by various authors, have been taken from http://gutenberg.spiegel.de. G-01 G-02 G-03 G-04 G-05 G-06 G-07 G-08 G-09 G-10 G-11 G-12 G-13 G-14 G-15 G-16 G-17
Schiller, F.v. Anonym Krummacher, F.A. Anonym Goethe, J.W.v. Sachs, H. Heine, H. Droste-Hülshoff, A. Goethe, J.W.v Goethe, J.W.v Goethe, J.W.v Goethe, J.W.v Fontane, Th. Goethe, J.W.v Moericke, E. Lichtwer, M.G. Goethe, J.W.v
Der Taucher Fabel – Zaunbär Das Krokodil Fabel – Mäuschen Der Gott und die Bajadere Das Kamel Belsazar Der Geierpfiff Elegie 19 Elegie 13 Elegie 15 Elegie 2 Gorm Grymme Elegie 5 Peregrina Die Rehe Der Erlkönig
Appendix: Texts
255
Hawaiian (V. Krupa) For Hawaiian, one of the approximately fourty Polynesian languages, six texts were analyzed, four of them (Hw-03 to Hw-06) were taken from the Internet source http://www2.hawaii.edu/~kroddy/moolelo/kawelo/. Hw-01 O Makalii, he hookele kaulana ia no na waa o Hawaii-nui Hw-02 Kaao no Punia. Selections from Fornander’s Hawaiian Antiquities and FolkLore. Honolulu, University of Hawaii Press 1959, pp. 6–17 Hw-03 Moolelo, Kawelo, Mokuna I – Ke kuauhau o kawelo Hw-04 Moolelo, Kawelo, Mokuna II – Ke hanau ana o kawelo Hw-05 Moolelo, Kawelo, Mokuna III – Ka hoolele lupe ana o kauahoa me kawelo Hw-06 Moolelo, Kawelo, Mokuna IV – Ka ike ana o ko kawelo, uhane ia uhumakaikai
Hungarian (G. Altmann) For Hungarian, one of the Ugro-Finnic (or Finno-Ugrian) languages, five online newspaper texts were taken randomly from the Internet; Hungarian is the principal member of the Ugric subgroup. H-01 H-02 H-03 H-04 H-05
Orbán Viktor beszéde az Astoriánál A nominalizmus forradalma Népszavazás Egyre több Kunczekolbász
Indonesian (G. Altmann) Indonesian (Bahasa Indonesia), the official language of Indonesia using Latin script, is one of the most widely spoken languages of the world; for our study, five online newspaper texts from the daily press were anaylzed. In-01 In-02 In-03 In-04 In-05
Assagaf-Ali Baba Jadi Asisten BRI Siap Cetak Miliarder Dalam Dua Bulan Pengurus PSM Terbelah Pemerintah Andalkan Hujan Pelni Jamin Tiket Tidak Habis
256
Appendix: Texts
Italian (I.-I. Popescu) For Italian, one of the Romance languages, closed parts of various literary texts by different authors were randomly taken from the Internet. I-01 I-02 I-03 I-04 I-05
Silvio Pellico Alessandro Manzoni Giacomo Leopardi Grazia Deledda Edmondo de Amicis
Le mie prigioni I promessi sposi Canti Canne al vento Il cuore
Kannada (B.D. Jayaram and M.N. Vidya) For Kannada, one of the four major Dravidian languages of South India, 47 literary and scientific texts have been chosen for analysis. Kn-001
Kn-003
Prof.B.C.Poojara, Prof.Balavantha. M.U Prof.B.C.Poojara, Prof.Balavantha. M.U Pradhana Gurudhat
Kn-004
Pradhana Gurudhat
Kn-005
T.R. Nagappa
Kn-006
T.R. Nagappa
Kn-007
Ranjana Bhatta
Kn-008
Ranjana Bhatta
Kn-010
Prof. K.D. Basava
Kn-011
D.N.S. Murthy
Kn-002
Lekka Parishodhaneya moolathatvagalu mattu Aacharane (1987), pp. 1–29 Lekka Parishodhaneya moolathatvagalu mattu Aacharane (1987), pp. 96–126 Aadalitha Bashe Kelavu Vicharagalu (1984), pp. 71–92 Aadalitha Bashe Kelavu Vicharagalu (1984), pp. 93–103 Vayskara Shikshana mathu swayam seve (1988), pp. 1–15 Vayskara Shikshana mathu swayam seve (1988), pp. 16–42 Kubera Rajya Chithra-Vichithra (1983), pp. 1–32 Kubera Rajya Chithra-Vichithra (1983), pp. 33–55 Lekka Parishodhana Shasthra (1984), pp. 1–25 Shreshta arthashasthagnayaru (1990), pp. 3–53 (continued on next page)
Appendix: Texts Kn-012 Kn-013 Kn-015 Kn-016 Kn-017 Kn-019
Pn. Keshava Murthy Galigali Shivshankar aralimatti B. Niranjan Ram Dr. Om Prakash Dr. Om Prakash Prof. K.D. Basava
Kn-020
Prof. K.D. Basava
Kn-021
Girija shankar
Kn-022
Girija shankar
Kn-023
Girija shankar
Kn-024
Shankar Mookashi Punekara Shankar Mookashi Punekara D. Ganesh D.D. Venkatarao D.D. Venkatarao K.S. Rangappa Vidyalankara, S.K. Ramachandra Rao Vidyalankara, S.K. Ramachandra Rao Vidyalankara, S.K. Ramachandra Rao Vidyalankara, S.K. Ramachandra Rao Girish Kaasarvalli Girish Kaasarvalli Girish Kaasarvalli Girish Kaasarvalli Karnataka Vanijya Karyasamsthegala
Kn-025 Kn-026 Kn-030 Kn-031 Kn-044 Kn-045 Kn-046 Kn-047 Kn-048 Kn-068 Kn-069 Kn-070 Kn-071 Kn-075
257
Ashtavarga Paddhathi, pp. 2–164 Asian Kredegalu (1983), pp. 1–12 Raviakashakke Booshanam (1987), pp. 3–41 Asthama (1982), pp. 1–42 Asthama (1982), pp. 43–73 Lekka Parishodhana Shasthra (1984), pp. 1–25 Lekka Parishodhana Shasthra (1984), pp. 28–40 Lekka Parishodhana Shasthra (1986), pp. 43–52 Lekka Parishodhana Shasthra (1986), pp. 53–80 Lekka Parishodhana Shasthra (1986), pp. 81–109 Avadheshwari (1987), pp. 1–30 Avadheshwari (1987), pp. 31–60 Makkala Aatikegalu (1988), pp. 1–42 Banking Siddhantha (1989), pp. 1–24 Banking Siddhantha (1989), pp. 35–61 Takkanige Pooje (1984), pp. 3–22 Karnatakadha kalegalu-Bhoomike (1987), pp. 73–102 Karnatakadha kalegalu-Bhoomike (1987), pp. 103–131 Karnatakadha kalegalu-Bhoomike (1987), pp. 132–153 Karnatakadha kalegalu-Bhoomike (1987), pp. 154–164 Cinema: kale-nele (1983), pp. 1–22 Cinema: kale-nele (1983), pp. 23–54 Cinema: kale-nele (1983), pp. 55–75 Cinema: kale-nele (1983), pp. 76–114 Adhiniyama (1988), pp. 1–24 (continued on next page)
258
Appendix: Texts
Kn-079 Kn-080
Dr. Nalini Murthy B.S. Chandrashekar
Kn-081
B.S. Chandrashekar
Kn-082
B.S. Chandrashekar
Kn-090
N.B.
Kn-100
N.B.
Kn-101 Kn-102 Kn-104 Kn-105 Kn-143 Kn-186
B. Harish B. Harish Eshwara chandra Eshwara chandra Dr. S. Rama Rao B.S. Chandrashekar
Ganaka Endharenu (1987), pp. 3–44 Samuha Samparka Madhyamagalu (1982), pp. 1–33 Samuha Samparka Madhyamagalu (1982), pp. 34–55 Samuha Samparka Madhyamagalu (1982), pp. 56–66 Kyamaronina Bannagalodane mattu Arab Deshagalalli (1984), pp. 5–28 Kyamaronina Bannagalodane mattu Arab Deshagalalli (1984), pp. 29–56 Naayi (1981), pp. 1–21 Naayi (1981), pp. 22–41 Nakashe kale (1985), pp. 1–6 Nakashe kale (1985), pp. 7–46 Prathama Chikitse, pp. 5–34 Samuha Samparka Madhyamagalu (1982), pp. 67–91
Lakota (R. Pustet) From the Native American language of Lakota, four tape-recorded and transcribed stories have been chosen for analysis; all stories were tape-recorded in Denver, Colorado, USA. Lk-01 Lk-02 Lk-03 Lk-04
The fly on the window. Neva Standing Bear tape-recorded 11/16/1994 Iktomi meets the prairie chicken and Blood Clot Boy. Neva Standing Bear tape-recorded 9/12/1994 Iktomi meets two women and Iya. Neva Standing Bear tape-recorded 9/19/1994 Bean, grass, and fire. Florine Red Ear Horse tape-recorded 9/19/1995
Appendix: Texts
259
Latin (I.-I. Popescu) For Latin, a historical Italic language (with its two varieties, Classical Latin and Vulgar Latin), six literary texts by different authors were studied. Lt-01 Lt-02 Lt-03 Lt-04 Lt-05 Lt-06
Vergil Apuleius Ovidius Cicero Martialis Horatius
Georgicon liber primus Fables, Book 1 Ars amatoria, liber primus Post reditum in senatu oratio Epigrammata Sermones. Liber 1, Sermo 1
Maori (V. Krupa) From Maori, one of the two most important languages of the Eastern Polynesian languages, five folk narratives and myths have been taken for analysis. M-01 M-02 M-03 M-04 M-05
Maori Nga Mahi a Nga Tupuna, ed. George Grey. Wellington, L.T. 3rd edition 1953 Ko te paamu tuatahi whakatiputipi kau a te Maori. Te ao hou the New World [electronic resource]; No. 5 (Spring 1953) A tawhaki,te tohunga rapu tuna. Te ao hou The New World [electronic resource]; No. 10 (April 1955) Ka pu te r uha ka hao te ranhatahi. Accessible in: Nga korero a reweti kohere ma (N ZETC, New Zealand Electronic Texts). Ka kimi a maui i ona matua, In: Te ao hou; No. 8, Winter 1954
260
Appendix: Texts
Marathi (B.D. Jayaram, M.N. Vidya) Marathi, an Indo-Aryan language of Western India, which is estimated to be more than 1300 years old, is the fourth most spoken language of India; 50 texts from different domains of writing and science have been analyzed. Mr-001 Mr-002 Mr-003 Mr-004 Mr-005 Mr-006 Mr-007 Mr-008 Mr-009 Mr-010 Mr-011
Prof. B.P. Joshi A.V. Patil B.A. Phadnis G.B. Dashputhre Pandurang, Sarala Chaudri Vedhprakash patil R.P.S Shivaji Tombre N.D. Patil, G.R.Dhume Dr. M.T.
Mr-015
Dr. M.T.
Mr-016 Mr-017
V.A.P. Shri V.A.P.
Mr-018
V.L. Pandy
Mr-020 Mr-021
V.A.P. V.D.P.
Mr-022
V.L. Pandy
Mr-023 Mr-024
V.A.P. V.D.P.
Mr-026 Mr-027
Kanchan Ganekar Prof. Sarangar
Nisar Sheti (1991), pp. 77–97 Roopvatika Sangapon (1984), pp. 146–161 Galithachi Pike (1988), pp. 67–83 Vanshi Aani Vanvignan (1986), pp. 29–133 Bhaajipala Utpadan (1988), pp. 1–32 Limbuvargiy Phaljhaade (1984), pp. 6–35 Anjirachi Lagvad (1990), pp. 5–33 Zamitil paani va shodh vaapar (1987), pp. 1–109 Bhaat–Sheti (1990), pp. 1–206 Pik Vadisaati Khathe, pp. 15–164 Jeevanache Rahasay Hasthasamudrik (1991), pp. 111–176 Jeevanache Rahasay Hasthasamudrik (1991), pp. 5–89 Thode Adbut Thode goodh (1986), pp. 5–48 Phalajyotishateel shankasamadhaan (1993), pp. 4–26 Thumcha chehara thumche yaktimatv, (1990), pp. 9–89 Thode Adbut Thode goodh (1986), pp. 49–74 Phaljothishateel shankasamadhaan (1993), pp. 27–55 Thumcha chehara thumcha yakthimathv (1990), pp. 90–113 Thode Adbut Thode goodh (1986), pp. 91–119 Phaljothishateel shankasamadhaan (1993), pp. 56–85 Nath ha majha (1989), pp. 1–17 Rashtriy Uthpann (1985), pp. 1-104 (continued on next page)
Appendix: Texts Mr-028 Mr-029 Mr-030 Mr-031 Mr-032
Mr-154
G.A. Kulkarni Madhu Nene Vishram Bedekar Pu.L. Deshpanday Sunitha Desh panday Lakshman Gayakwad Sou Veena Gavankar Anil Avachat Kanchan Ganekar Madhu Nene P.L. Deshpandya Sou Veena gavankar Madhu Nene Madhu Nene Aravind . . . ladhar And Madhav Saraf Muthali Desai Daasthane And Madhav Saraf
Mr-288 Mr-289 Mr-290 Mr-291 Mr-292 Mr-293 Mr-294 Mr-295 Mr-296 Mr-297
Madhav Gadkari Madhav Gadkari Arun Tikker Madhav Gadkari Madhav Gadkari Arun Tikker Madhav Gadkari Madhav Gadkari Madhav Gadkari Madhav Gadkari
Mr-033 Mr-034 Mr-035 Mr-036 Mr-038 Mr-040 Mr-043 Mr-046 Mr-052 Mr-149 Mr-150 Mr-151
261
Mansi Arbhaat Aani Chillar, pp. 9–26 Bankingshi Manthri, pp. 1-62 Ek jhaad aani don pakshi (1992), pp. 1–18 Maithr (1989), pp. 1–15 Aahe Manohar Tari. . . , pp. 1–16 Uchalya (1990), pp. 1–16 Ek Hota Kabir (1988), pp. 1–118 Swathavishayi (1990), pp. 1–6 Nath ha maja (1989), pp. 18–34 Bankingshi Manthri, pp. 82–127 Maithr, pp. 86–107 Ek hota Kabir (1988), pp. 119–207 Bankingshi Manthri, pp. 128–168 Bankingshi Manthri, pp. 169–207 Sheharbaazarshi Manthri (1986), pp. 1–13 Shearbaazar ek alibabachi guha (1990), pp. 1–49 Jagahtheel pramukh Arthvyavastha (1986), pp. 1–62 Shearbaazar ek alibabachi guha (1990), pp. 50–113 Chaupher (1988), pp. 1–14 Choupher Vol 3 (1989), pp. 4–42 Samaaj-Spandane (1990), pp. 8–54 Chaupher (1988), pp. 15–27 Chaupher (1988), pp. 43–84 Samaaj-Spandane (1990), pp. 62-91 Chaupher (1988), pp. 28–40 Chaupher (1988), pp. 94–170 Chaupher (1988), pp. 41–53 Chaupher (1988), pp. 201–215
262
Appendix: Texts
Marquesan (V. Krupa) Marquesan is one of the Eastern Polynesian languages, spoken in the Marquesas Islands of French Polynesia. Here, three Marquesan folklore texts from different sources were taken for analysis. Mq-01 Story Kopuhoroto’e II from the collection Henri Lavondès: Récits marquisiens dits par Kehuenui avec la collaboration de S. Teikihuupoko. Publication provisoire. Papeete, Centre ORSTOM 1964. Mq-02 Ka’akai o Te Henua ’Enana. A Story of the Country of People recorded by Sam H. Elbert. Mq-03 Te hakamanu. La danse de ’oiseau. Légende marquisienne. Texte marquisien: Lucien Teikikeuhina Kimitete Papeete, Haere Po no Tahiti 1990.
Rarotongan (V. Krupa) Rarotongan, which is also called Cook Islands Maori language, is one of the Polynesian languages; five folkloristic prose texts were taken from Legends from the Atolls (ed. Kauraka Kauraka, Suva 1983). Rt-01 Rt-02 Rt-03 Rt-04 Rt-05
Kauraka Kauraka Tepania Puroku Herekaiura Atama Temu Piniata Kaimaria Nikoro
Akamaramaanga (1983) Ko Paraka e te Kehe(1977) Ko Tamaro e ana uhi (1977) Te toa ko Teikapongi (1982) Te toa ko Herehuaroa e Araitetonga (1982)
Romanian (I.-I. Popescu) For Romanian, a Romance language, six literary texts by Romanian poet Mihail Eminescu (1850–1889) were taken from http://www.romanianvoice. com/poezii/poeti/eminescu.php R-01 R-02 R-03 R-04 R-05 R-06
Eminescu, M. Eminescu, M. Eminescu, M. Eminescu, M. Eminescu, M. Eminescu, M.
Luceafarul – Lucifer Scrisoarea III – Satire III Scrisoarea IV – Satire IV Scrisoarea I – Satire I Scrisoarea V – Satire V Scrisoarea II – Satire II
Appendix: Texts
263
Russian (P. Grzybek) For Russian, the most widely spoken of the Slavic languages, five literary prose texts from different authors were taken from the Graz Text Data Base http://quanta-textdata.uni-graz.at/ Ru-01 Ru-02 Ru-03 Ru-04 Ru-05
Fedor M. Dostoevskij Nikolaj V. Gogol’ Viktor Pelevin Lev N. Tolstoy Ivan S. Turgenev
Prestuplenie i nakazanie (p. I, ch. 1) Portret Buben verchnego mira Metel’ Bežin lug
Samoan (V. Krupa) For Samoan, one of the Polynesian languages, five folkloristic prose texts were taken from: Tala o le Vavau. The Myths, Legends and Customs of Old Samoa (Polynesian Press Samoa House, Auckland 1987). Sm-01 Sm-02 Sm-03 Sm-04 Sm-05
O le mea na maua ai le ava, pp. 15–16 O le tala ia Sina ma lana tuna, pp. 17–19 O le tala ia Tamafaiga, pp. 49–52 O le faalemigao, pp. 91–92 O upu faifai ma le gaoi, p. 95
Slovene (P. Grzybek) For Slovene, a South Slavic language, five literary prose texts from different authors were taken from the Graz Text Data Base http://quanta-textdata. uni-graz.at/. Sl-01 Sl-02 Sl-03 Sl-04 Sl-05
Ivan Cankar Slavko Grum Josip Jurˇciˇc Ferdo Koˇcevar Fran Levstik
V temi Vrata Sosedov sin (ch. I) Grof in menih Zveženj
264
Appendix: Texts
Tagalog/Pilipino (G. Altmann) For Tagalog, a Central Philippine language, belonging to the Malayo-Polynesian group, three fiction texts were taken from http://www.seasite.niu.edu/ Tagalog/tagalogshortstoriesfs.htm. T-01 T-02 T-03
A.V. Hernandez A.V. Hernandez A.B.L. Rosales
Magpinsan Limang Alas, Tatlong Santo Kristal Na Tubig
References
Altmann, G., Lehfeldt, W. 1980 Einführung in die quantitative Phonologie. Bochum: Brockmeyer. Altmann, G. 1988 Wiederholungen in Texten. Bochum: Brockmeyer. Altmann, G.; Köhler, R. 1996 “‘Language Forces’ and synergetic modelling of language phenomena.” In: Glottometrika, 15; 62–76. Altmann, V.; Altmann, G. 2005 Erlkönig und Mathematik. http://ubt.opus.bhz-nrw.de/ volltexte/2005/325. Andersen, S.; Altmann, G. 2005 “Information content of words in text.” In: Grzybek, P. (ed.), Contributions to the Science of Language. Word Length Studies and Related Issues. Boston: Kluwer, 93–117. Baayen, R.H. 1989 A corpus-based approach to morphological productivity. Statistical analysis and psycholinguistic interpretation. Amsterdam: Centrum voor Wiskunde en Informatica. 2001 Word frequency distributions. Dordrecht: Kluwer. Beisel, J.-N.; Usseglio-Polatera, P.; Bachmann, V.; Moretau, J.-C. 2003 “A comparative analysis of evenness index sensitivity.” In: International Review of Hydrobiology, 88; 3–15. Best, K.-H.; Kohlhase, J. (eds.) 1983 Exakte Sprachwandelforschung. Göttingen: Herodot. Best, K.-H. (ed.) 1997 The distribution of word and sentence length. Trier: WVT. [= Glottometrika; 16] 2001 Häufigkeitsverteilungen in Texten. Göttingen: Peust & Gutschmidt. Boroda, M.G. 1973 “K voprosu o metroritmiˇceski elementarnoj edinice v muzyke.” In: Soobšˇcenija Akademii nauk Gruzinskoj SSR, 71/3; 745–748. 1982 “Die melodische Elementareinheit.” In: Orlov, J.K., Boroda, M.G., Nadarejšvili, I.Š. (eds.), Sprache, Text, Kunst. Quantitative Analysen. Bochum: Brockmeyer, 205–221. 1986 Problemy segmentacii v muzyke. Strukturnye edinici muzykal’noj reˇci. Moskva: GBL. 1991 “Rhythmic models in music: towards the quantitative study.” In: Musikometrika, 3; 123–162.
266
References
1992
“Towards a phrase type melodic unit in music.” In: Musikometrika, 4; 15–81. Bortz, J.: Lienert, G.A.; Boehnke, K. 1990 Verteilungsfreie Methoden in der Biostatistik. Berlin: Springer. Brainerd, B. 1976 “On the Markov nature of the text.” In: Linguistics, 76; 5–30. Bunge, M. 1961 “Kinds and criteria of scientific laws.” In: Philosophy of Science, 28; 260–281. 1977 The furniture of the world. Dordrecht: Reidel. Bybee, J.; Hopper, P. (eds.) 2001 Frequency and the emergence of linguistic structure. Amsterdam. Chitashvili, R.J.; Baayen, R.H. 1993 “Word frequency distribution of text and corpora as large number of rare event distributions.” In: Hˇrebíˇcek, L.; Altmann, G. (eds.), Quantitative Text Analysis. Trier: WVT, 54–135. Erdélyi, A., Magnus, W., Oberhettinger, F., Tricomi, F.G. 1955 Higher Transcendental Functions I-III. New York: McGraw-Hill. Esteban, M.D., Morales, D. 1995 “A summary of entropy statistics.” In: Kybernetica, 31/4; 337–346. Estoup, J.B. 1916 Les gammes sténographiques. Paris: Institut Sténographique. Greenberg, J.H. 1960 “A quantitative approach to the morphological typology of languages.” In: International Journal of American Linguistics, 26; 178–194. Grotjahn, R., Altmann, G. 1993 “Modelling the distribution of word length: some methodological problems.” In: Köhler, R.; Rieger, B. (eds.), Contributions to Quantitative Linguistics. Dordrecht: Kluwer, 141–153. Grzybek, P. (ed.) 2006 Contributions to the science of language. Word length studies and related issues. Boston: Kluwer. Guiter, H. 1974 “Les rélations (fréquence-longueur-sens) des mots (langues Romanes et Anglais).” In: Atti del Congresso Internazionale di Linguistica e Filologia Romanza, 14/4; 373–381. Haight, F.A. 1969 “Two probability distributions connected with Zipf’s rank-size conjecture.” In: Zastosowania Matematyki, 10; 225–228. Herdan, G. 1956 Language as choice and chance. Groningen: Nordhoff.
References 1969 Hirsch, J.E. 2005
267
“Mathematical models of language.” In: Studium Generale, 22; 191– 196.
An index to quantify an individual’s scientific research output. http://arxiv.org/PS_cache/physics/pdf/0508/0508025.pdf http://en.wikipedia.org/wiki/Hirsch_number Johnson, N.L.; Kotz, S. 1970 Continuous univariate distributions I-II. Boston: Houghton Mifflin. Kendall, M.G.; Stuart, A. 1963 The advanced theory of statistics. London: Griffin. Köhler, R. 1986 Zur linguistischen Synergetik. Struktur und Dynamik der Lexik. Bochum: Brockmeyer. Köhler, R.; Altmann, G. 1993 “Begriffsdynamik und Lexikonstruktur.” In: Beckmann, F.; Heyer, G. (eds.), Theorie und Praxis des Lexikons. Berlin: de Gruyter, 173–190. Köhler, R.; Galle, M. 1993 “Dynamic aspects of text characteristics.” In: Hˇrebíˇcek, L., Altmann, G. (ed.), Quantitative Text Analysis. Trier: WVT, 46–53. Köhler, R. 1995 Bibliography of Quantitative Linguistics. Amsterdam/Philadelphia: John Benjamins. 2005 “Synergetic linguistics.” In: Köhler, R.; Altmann, G.; Piotrowski, R.G. (eds.), Quantitative Linguistics. An International Handbook. Berlin: de Gruyter, 760–774. 2006 “The frequency distribution of the lengths of length sequences.” In: J. Genzor; M. Bucková (eds.), Favete linguis. Studies in honour of Victor Krupa. Bratislava: Slovak Academic Press, 145–152. 2007 Word length in text. A study in the syntagmatic dimension. [In print] Köhler, R.; Naumann, S. 2008 “Quantitative text analysis using L-, F- and T -segments.” In: Preisach, C.; Burckhardt, H.; Schmidt-Thieme, L.; Decker, R. (eds.), Data Analysis, Machine Learning and Applications. Berlin, Heidelberg: Springer, 637–646. Kornai, A. 2002 “How many words are there?” In: Glottometrics, 4; 61–86. Krupa, V. 1965 “On quantification of typology.” In: Linguistics, 12; 31–36. Li, W. 1992 “Random texts exhibit Zipf’s-law-like word frequency distribution.” In: IEEE Transactions on Information Theory, 38; 1842–1845.
268
References
1999 Maˇcutek, J. 2006
References on Zipf’s law. http://linkage.rockefeller.edu/wli/zipf/
“Pairs of corresponding discrete and continuous distributions – mathematics behind, algorithms and generalizations.” In: Grzybek, P.; Köhler, R. (ed.), Exact Methods in Study of Language and Text. Berlin: de Gruyter, 407–414. Maˇcutek, J.; Popescu, I.-I.; Altmann, G. 2007 “Confidence intervals and tests for the h-point and related text characteristics.” In: Glottometrics , 15; 42–52. Maˇcutek, J.; Altmann, G. 2007a “Discrete and continuous modelling in quantitative linguistics.” In: Journal of Quantitative Linguistics, 14/1; 81–94. 2007b Parallel discrete and continuous distributions defined on bounded supports. [In print] Mandelbrot, B. 1953a Théorie informationelle (sémiologique) de la structure statistique des langues, Linguistique Saussurienne et loi de Zipf – Technical Report. Cambridge, Mass.: Research laboratory of electronics. 1953b “An informational theory of the statistical structure of language.” In: Jackson, W. (ed.), Communication Theory. London: Butterworth, 486– 502. 1961 “Final note on a class of skew distribution functions: analysis and critique of a model due to H.A. Simon.” In: Information and Control, 4; 198–216. McIntosh, R.P. 1967 “An index of diversity and the relation of certain concepts to diversity.” In: Ecology, 48; 392–404. Meyer-Eppler, W. 1969 Grundlagen und Anwendungen der Informationstheorie. Berlin: Springer. Miller, G.A. 1957 “Some effects of intermittent silence.” In: The American Journal of Psychology, 70; 311–314. Ord, J.K. 1967 “On a system of discrete distribution.” In: Biometrika, 54; 649–659. 1972 Families of frequency distributions. London: Griffin. Orlov, J.K.; Boroda, M.G.; Nadarejšvili, I.Š. 1982 Sprache, Text, Kunst. Quantitative Analysen. Bochum: Brockmeyer. Popescu, I.-I., Altmann, G. 2006 “Some aspects of word frequencies.” In: Glottometrics, 13; 23–46.
References 2006b
269
“Some geometric properties of word frequency distributions.” In: Göttinger Beiträge zur Sprachwissenschaft, 13; 87–98. 2008 “Autosemantic compactness of texts.” In: Altmann, G.; Zadorozhna, I.; Matskulyak, J. (eds.), Problems of General, Germanic and Slavic Linguistics. Papers for 70th Anniversary of Professor V. Levickij. Chernivtsy: Books-XXI, 472–480. 2009 “A modified text indicator.” In: Kelih, E.; Levickij, V.; Altmann, G. (eds.), Problems of quantitative text analysis. Chernivtsy: Ruta. [Submitted] Popescu, I.-I.; Best, K.-H.; Altmann, G. 2007 “On the dynamics of word classes in text.” In: Glottometrics, 14; 58– 71. Sandefur, J.T. 1990 Discrete dynamical systems. Theory and applications. Oxford: Clarendon Press. Sigurd, B. 1968 “Rank-frequency distribution for phonemes.” In: Phonetica, 18; 1–15. Skorochod’ko, E.F. 1981 Semantische Relationen in der Lexik und in Texten. Bochum: Brockmeyer. Strauß, U.; Sappok, Ch.; Diller, H.J.; Altmann, G. 1984 “Zur Theorie der Klumpung von Textentitäten.” In: Glottometrika, 7; 73–100. Tuldava, J. 1974 “O statistiˇceskoj strukture teksta.” In: Sovetskaja pedagogika i škola, 9; 5–33. 1998 Probleme und Methoden der quantitativ-systemischen Lexikologie. Trier: WVT. Uhlíˇrová, L. 2007 “Word frequency and position in sentence.” In: Glottometrics, 14; 1– 20. West, D.B. 2001 Introduction to graph theory. Upper Saddle River, NJ: Pentice Hall. Wimmer, G.; Köhler, R.; Grotjahn, R.; Altmann, G. 1994 “Towards a theory of word length distribution.” In: Journal of Quantitative Linguistics, 1; 98–106. Wimmer, G.; Altmann, G. 1996 “The theory of word length: Some results and generalizations.” In: Glottometrika, 15; 112–133. 1999 “Review Article: On vocabulary richness.” In: Journal of Quantitative Linguistics, 6/1; 1–9. 1999a Thesaurus of univariate discrete probability distributions. Essen: Stamm.
270
References
Wimmer, G.; Altmann, G.; Hˇrebíˇcek, L.; Ondrejoviˇc, S.; Wimmerová, S. 2003 Úvod do analýzy textov. Bratislava: Veda. Wimmer, G. 2005 “The type-token relation.” In: Köhler, R.; Altmann, G.; Piotrowski, R.G. (eds.), Quantitative Linguistics. An International Handbook. Berlin: de Gruyter, 361–368. Wimmer, G.; Altmann, G. 2005 “Unified derivation of some linguistic laws.” In: Köhler, R.; Altmann, G.; Piotrowski, R.G. (ed.), Quantitative Linguistics. An International Handbook. Berlin: de Gruyter, 791–807. 2006 “Towards a unified derivation of some linguistic laws.” In: Grzybek, P. (ed.), Word length studies and related issues. Dordrecht: Springer, 329–337. Ziegler, A.; Altmann, G. 2002 Denotative Textanalyse. Wien: Edition Praesens. Zipf, G.K. 1935 The psycho-biology of language. Boston: Houghton Mifflin. 1949 Human behavior and the principle of least effort. Cambridge: AddisonWesley. Zörnig, P., Altmann, G. 1983 “The repeat-rate of phoneme frequencies and the Zipf-Mandelbrot law.” In: Glottometrika, 5; 205–211. 1984 “The entropy of phoneme frequencies and the Zipf-Mandelbrot law.” In: Glottometrika, 6; 41–47.
Subject index
A A- indicator . . . . . . . . . . . . . . . . . . . . . 86 α -graph . . . . . . . . . . . . . . . . . . . 125, 126 autosemantic . . . . . . . 1, 18, 30, 36, 37, 87, 91, 95–97, 101–103, 107, 109, 111–126, 206, 216, 249 autosemantic compactness . . . . 1, 104, 105, 107–109 autosemantic pace filling . . . . . . . . 104 auxiliary word . . . . . . . . . . . . . . . . . . 113 B B- indicator . . . . . . . . . . . . . . . . . . . . . 86 b-indicator . . . . . . . . . . . . . . . . . . . . . . 44 β -graph . . . . . . . . . . . . . . 122, 124–126 binominal distribution see distribution, binominal Bulgarian 4, 24, 43, 47, 159, 164, 204, 205, 211, 218 C compactness . . . see also autosemantic compactness, 101–110 Conway-Maxwell-Poisson distribution . . . . . . . see distribution, Conway-Maxwell-Poisson co-occurrence . . . . . . . . . . . . . 111–124 crowding . . . . . . . . . 101–103, 210, 249 cumulative distribution . . . . . . . . . . see distribution, cumulative Czech . . . 4, 24, 43, 47, 159, 164, 167, 253 D density . . . . . . . . . . . . . . . . . . . . . . 1, 111 distance . 17, 50, 63, 87, 95, 109, 111, 166, 210, 227–228, 232 distribution
binomial 125, 131, 142 Conway-Maxwell-Poisson 131, 142 cumulative 26, 27, 29, 49 Estoup 131, 134 extended Zipf- Mandelbrot 143 frequency 1, 113, 127, 131, 164, 167, 224, 249 geometric 125, 128, 130, 131, 134, 228, 229 Good 132, 148 Haight- Zeta 143 Hyperpascal 131, 134, 149, 223 Hyperpoisson 131, 134 Johnson-Kotz 131, 134, 144, 223 Lerch 132 Lotka 131 Naranan-Balasubrahmanyan 132 negative binomial 134, 145, 223, 229 negative hypergeometric 131, 134, 149, 153, 156, 160 Poisson 125, 129–131 rank frequency vi, 9, 14, 17, 18, 28, 29, 35, 37, 48, 55, 73–75, 87, 101, 128, 132, 143, 153, 154, 156, 158, 160, 166, 167, 169, 174, 185, 219, 221 Riemann Zeta 127 right truncated Zeta 28, 29, 125, 127, 135, 153, 186, 190, 191, 222, 223 right truncated Zipf-Mandelbrot 153 Rouault 131 Simon 131 Waring 131, 223
272
Subject index Yule 131, 148 Zeta 127, 132, 133, 135, 143, 152, 153, 185, 186, 189, 190, 223 Zipf-Mandelbrot 132–134, 137, 149, 152, 153, 221–222
E English . 4, 6, 7, 18, 24, 43, 47, 75, 89, 155–164, 183, 232, 234 entropy . . . . . . . . . . . . . . . 173–185, 249 essentialism . . . . . . . . . . . . . . . . . . . . . . 2 Estoup distribution . . see distribution, Estoup extended Zipf-Mandelbrot distribution . . . . . . . see distribution, extended Zipf-Mandelbrot F Fourier analysis . . . . . . . . . . . . 214–216 frequency distribution see distribution, frequency frequency spectrum see also spectrum, 9, 10, 13, 35, 37, 57, 73, 74, 81, 87, 125, 132, 143, 192 G γ -graph . . . . . . . . . . . . . . . . . . . . . . . . 123 generalization . 8, 125, 127, 129, 134, 224, 250 geometric distribution see distribution, geometric German . 4, 5, 7, 24, 43, 47, 111, 113, 159, 164, 196, 197, 213, 232, 234 Gini’s coefficient . . . . . . . . . . . . . 54–57 Good distribution . . . . see distribution, Good Greenberg-Krupa index . . . . . . . . . . 24 H h-coverage . . . . . . . . . . . . . . . . . . . . . . 30 corrected 30 h-point . 17–30, 35, 37, 44, 52, 73, 95, 101, 109, 249
Haight-Zeta distribution . . . . . . . . . see distribution, Haight-Zeta hapax legomena . . . 30, 37, 50, 74, 87, 214–229 Hawaiian . . . . . . . . . . . . . . 4, 24, 43, 47 homogeneity . 2, 3, 165, 214, 241, 243 Hungarian . . .4, 5, 24, 34, 43, 47, 100, 159, 164, 199, 201, 236 Hyperpascal distribution . . . . . . . . . see distribution, Hyperpascal Hyperpoisson distribution . . . . . . . . see distribution, Hyperpoisson hypothesis . . . .91, 133, 186, 192, 203, 207–209, 213, 214, 216, 225, 242, 243, 251 I Indonesian . 4, 5, 9, 24, 25, 42, 43, 47, 100, 159, 164 Italian . . . . . . . . 4, 24, 43, 47, 159, 164 J Japanese . . . . . . . . . . . . . . . . . . 5, 9, 197 Johnson-Kotz distribution . . . . . . . . see distribution, Johnson-Kotz K k-point . . . . . . . . . . . . . . . . . . . 35–38, 81 Köhler’s control cycle . . . . . . . . . . . . . 1 Köhler-Galle method . . . . . . . . . . . .239 Kannada 4, 6, 24, 43, 47, 81, 159, 164 L Lakota . . . . . . . .4, 24, 43, 47, 159, 164 language agglutinating 43 analytic 6, 24, 48, 160 inflectional 6, 37, 43 synthetic 160, 164 Latin . . 4, 6, 24, 33, 34, 43, 47, 48, 87, 159, 164 law 7, 87, 90, 127, 129, 133, 186, 203, 225, 248
Subject index lemma . . . . 5, 6, 9, 10, 25, 26, 37, 112, 116, 117, 121, 124, 203 lemmatisation . . . . . . . . . . . 6, 111, 234 Lerch distribution . . . see distribution, Lerch Lorenz curve . . . . . . . . . . . . . 55, 57, 63 Lotka distribution . . . see distribution, Lotka M m-coverage . . . . . . . . . . . . . . . . . . 52–54 m-point . . . . . . . . . . . . . . . 48–52, 63, 65 Malay . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Maori . . . . . . . . 4, 24, 43, 47, 159, 164 Marathi . 4, 5, 24, 43, 47, 52, 159, 164 Markov chain . . . . . 219–228, 250, 251 Marquesan . . . . 4, 24, 43, 47, 159, 164 Maxwell’s demon . . . . . . . . . . . . . . . . 19 min D . . . . . . . . . . . . . . . . 50, 52, 53, 64 N n-coverage . . . . . . . . . . . . . . . . . . . . . . 64 n-point . . . . . . . . . . . . . . . . . . . . . . 54–65 Naranan-Balasubrahmanyan distribution . . . . . . . see distribution, Naranan-Balasubrahmanyan negative binomial distribution see distribution, negative binomial negative hypergeometric distribution . . . . . . . see distribution, negative hypergeometric O Ord’s criterion . . . 154, 155, 159, 160, 164, 250 P part of speech . . . . . . . . . . 88, 185, 203 Piotrowski law . . . . . . . . . . . . . . . . . . 88 plain counting . . . . . . . . . . . . . . . . . . 234 Poisson distribution . . see distribution, Poisson population . . . . . . . . . . . . . . . . . . . . . 1, 2
273
position absolute 210 final 211–213 initial 212, 213 relative 214–215 R R1 . . . . . . . . . . . . . . . . . . . . . . . . . . 33–34 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . 38–43 R3 . . . . . . . . . . . . . . . . . . . . . . . 50–54, 65 R4 . . . . . . . . . . . . . . . . . . . . . . . . . . 58–63 rank frequency distribution . . . . . . . see distribution, rank frequency Rarotongan . . . . . . . . 4, 24, 43, 47, 159 ratio measurement . . . . . . . . . . . . . . . . 1 ratio method . . . . . . . . . . . . . . . . . . . 240 repeat rate . . . . . . . . . . . . . . . . . 165–185 requirement 3, 71, 186, 197, 198, 234, 235 reverse cumulative presentation . . . . 9 Riemann Zeta . . . see also distribution, Zeta right truncated Zeta distribution . . see distribution, right truncated Zeta right truncated Zipf-Mandelbrot distribution see distribution, right truncated Zipf-Mandelbrot Romanian . . . . 4, 24, 43, 47, 159, 164 Rouault distribution . . see distribution, Rouault runs . . . . . . . . . . . . . . . . . . 203–227, 250 Russian . . . . . . . 4, 24, 43, 47, 159, 164 S Samoan . . . . . . 4, 24, 43, 47, 159, 164 self-regulation . . . 1, 17, 100, 153, 197 Simon distribution . . . see distribution, Simon Slavic languages . . . . . . . . . . . . . . . . . . 9 Slovenian . . . . . . . . . . . . . . 4, 24, 43, 47 spectrum . . . . . . . . . 165–192, 219–223
274
Subject index
standard measurement 234, 239, 244 stratified measurement . . . . . . see also window method, 1, 241 suppletion . . . . . . . . . . . . . . . . . . . . . . . . 6 synsemantic . . . 18–19, 23, 37, 38, 216 T Tagalog 4, 24, 25, 33, 43, 47, 159, 164 text coverage . . . . . . . . . . 1, 14, 73, 249 text length N . . .11, 19, 34, 36–44, 54, 70, 79, 80, 85, 107, 179–180, 239 text mixture . . . . . . . . . . . . . . . . . . 2, 232 thematic concentration 1, 19, 95–101, 103, 107, 123, 249 TTR . see also type-token relation, 71, 219, 231–248 type-token relation . see also TTR, vi, 1, 15, 231 V vocabulary richness . . . . . . . . 1, 29–31,
36–42, 52–58, 63, 73, 76–81, 166, 173, 232, 234, 239, 249 W Waring distribution . . see distribution, Waring window method . . . . see also stratified measurement, 241 word classes . . 87, 185–192, 195, 196 wording indicator . . . . . . . . . . . . 84, 86 Y Yule distribution see distribution, Yule Z Zeta distribution see distribution, Zeta Zipf law . . .see also distribution, Zeta, 127 Zipf- Orlov size . . . . . . . . . . . . . . . . . 70 Zipf-Estoup law . see also distribution, Zeta, 127 Zipf-Mandelbrot distribution . . . . . see distribution, Zipf-Mandelbrot
Author index
A Altmann, G. . . . . . . . . . 8, 9, 14, 17, 19, 25, 37, 48, 49, 54, 55, 71, 73, 75, 85, 87, 95, 101, 102, 111, 113, 117, 125, 126, 129, 130, 132, 143, 179, 186, 214, 224, 227–229, 233, 234, 237 Altmann, V. . . . . . . . . . . . . . . . . . . . . 111 Andersen, S. . . . . . . . . . . . . . . . . . . . . 71 B Baayen, R.H. . . . . . . . . . . . . . . 132, 143 Bachmann, V. . . . . . . . . . . . . . . . . . . 165 Beisel, J.-N. . . . . . . . . . . . . . . . . . . . . 165 Best, K.-H. . . . . . . 17, 87, 88, 186, 224 Boehnke, K. . . . . . . . . . . . . . . . . . . . 208 Boroda, M.G. . . . 1, 70, 133, 203, 218, 234 Bortz, J. . . . . . . . . . . . . . . . . . . . . . . . 208 Brainerd, B. . . . . . . . . . . . . . . . . . . . . 227 Bunge, M. . . . . . . . . . . . . . . . . . . . 7, 127 Bybee, J. . . . . . . . . . . . . . . . . . . . . . . 2, 8 C Chitashvili, R.J. . . . . . . . . . . . . . . . . 143 D Diller, H.J. . . . . . . . . . . . . . . . . 227–229 Droste-Hülshoff, A.v. . . . . . 88, 89, 92
G Galle, M. . . . . . . . . . . . . . . . . . . . . . . 239 Goethe, J.W.v. . 26, 28–30, 35, 38, 48, 49, 55, 64, 112, 114, 239 Greenberg, J.H. . . . . . . . . . . . . . . . . . . 24 Grotjahn, R. . . . . . . . . . . . . . . . . . . . . 224 Grzybek, P. . . . . . . . . . . . . . . . . . . . . 224 Guiter, H. . . . . . . . . . . . . . . . . . . . . . . 199 H Hˇrebíˇcek, L. . . . . . . . . . . . . . . . . . . . . . . 8 Haight, F.A. . . . . . . . . . . . . . . . . . . . . 143 Herdan, G. 1, 128, 153, 236, 238, 239, 241, 242, 245–248 Hirsch, J.E. . . . . . . . . . . . . . . . . . . 17, 19 Hopper, P. . . . . . . . . . . . . . . . . . . . . . 2, 8 J Johnson, N.L. . . . . . . . . . . . . . . . . . . 132 K Köhler, R. . . . 129, 197–199, 203, 218, 219, 224, 225, 232, 233, 239 Kant, I. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Kendall, M.G. . . . . . . . . . . . . . . . . . . 168 Kohlhase, J. . . . . . . . . . . . . . . . . . . . . . 88 Kornai, A. . . . . . . . . . . . . . . . . . . . . . 232 Kotz, S. . . . . . . . . . . . . . . . . . . . . . . . . 132 Krupa, V. . . . . . . . . . . . . . . . . . . . . . . . 24
E Erdélyi, A. . . . . . . . . . . . . . . . . . . . . . 152 Esteban, M.D. . . . . . . . . . . . . . 173, 185 Estoup, J.B. . . . . . . . . . . . . . . . . . . . . 127
L Lehfeldt, W. . . . . . . . . . . . . . . . . . . . 179 Lewis, S. . . . . . . . . . . . . . 104, 105, 108 Li, W. . . . . . . . . . . . . . . . . . . . . . . . . . 127 Lienert, G.A. . . . . . . . . . . . . . . . . . . . 208
F Feynman, R.P. . . . . . . . . . . . . . . . . . . 108
M Maˇcutek, J. . . . . . . . . . . . . . . . . . 25, 130
276
Author index
Magnus, W. . . . . . . . . . . . . . . . . . . . . 152 Mandelbrot, B. . . . . . . . . . . . . 127, 143 Marx, K. . . . . . . . . . . . . . . . . . . . . . . . . . 2 McIntosh, R.P. . . . . . . . . . . . . . . . . . 167 Meyer-Eppler, W. . . . . . . . . . . . . . . . 203 Miller, G.A. . . . . . . . . . . . . . . . . . . . . 127 Morales, D. . . . . . . . . . . . . . . . 173, 185 Moretau, J.-C. . . . . . . . . . . . . . . . . . . 165 N Nadarejšvili, I.S. . . . . . 1, 70, 133, 234 O Oberhettinger, F. . . . . . . . . . . . . . . . .152 Ondrejoviˇc, S. . . . . . . . . . . . . . . . . . . . . 8 Ord, J.K. . . . . . . . . . . . . . . . . . . . . . . .154 Orlov, Ju.K. . . . . . 1, 70, 132, 133, 234 P Plato . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Popescu, I.-I. 9, 14, 17, 19, 25, 37, 48, 49, 54, 55, 73, 75, 85, 87, 95, 101, 102, 186 R Rutherford, E. . . . . . . . . 89–91, 95, 96, 101–105, 108, 186
S Sandefur, J.T. . . . . . . . . . . . . . . . . . . . 18 Sappok, Ch. . . . . . . . . . . . . . . . 227–229 Sigurd, B. . . . . . . . . . . . . . . . . . . . . . . 128 Skorochodko, E.F. . . . . . . . . . . . . . . 126 Strauß, U. . . . . . . . . . . . . . . . . . 227–229 Stuart, A. . . . . . . . . . . . . . . . . . . . . . . 168 T Tricomi, F.G. . . . . . . . . . . . . . . . . . . . 152 Tuldava, J. . . . . . . . . . . . . . . . . . . . . . 236 U Uhlíˇrová, L. . .203, 204, 211, 217, 244 Usseglio-Polatera, P. . . . . . . . . . . . . 165 W West, D.B. . . . . . . . . . . . . . . . . . . . . . 126 Wimmer, G. . 8, 37, 71, 129, 132, 143, 224, 228, 234, 237, 240 Wimmerová, S. . . . . . . . . . . . . . . . . . . . 8 Z Zörnig, P. . . . . . . . . . . . . . . . . . . . . . . 179 Ziegler, A. 95, 111, 113, 117, 125, 126 Zipf, G.K. . . . v, 1, 127, 143, 153, 197, 199
Addresses
Altmann, Gabriel Stüttinhauser Ringstr. 44 D–58515 Lüdenscheid, Germany email: ram-verlag@@t-online.de Grzybek, Peter Graz University, Department for Slavic Studies A–8010 Graz, Austria email: peter.grzybek@ui-graz Jayaram, Bijapur Dayaloo Central Institute of Indian Languages I–570006 Mysore, Hunsur Road, Manasagangotri, India email: [email protected] Köhler, Reinhard Universität Trier, Linguistische Datenverarbeitung D–54296 Trier, Universitätsring 15, Germany email: [email protected] Krupa, Viktor Institute of Oriental and African Studies SAS SK–81364 Bratislava, Klemensová 19, Slovakia email: [email protected] Maˇcutek, Ján Comenius University, Department of Applied Mathematics and Statistics SK–84248 Bratislava, Mlynská dolina, Slovakia email: [email protected] Popescu, Ioan-Iovitzu POB MG–18 RO–77125 Magurele / Ilfov, Romania email: [email protected]
278
Addresses
Pustet, Regina Universität München, Institut für Allg. und Typologische Sprachwissenschaft D–80539 München Ludwigstraße 25 email: [email protected] Uhlíˇrová, Ludmila Nýˇranská 12 CZ–15300 Praha 16, Czech Republic email: [email protected] Vidya, Matummal N. Central Institute of Indian Languages I–570006 Mysore, Hunsur Road, Manasagangotri, India email: [email protected]