Systems and Frameworks for Computational Morphology

Communications in Computer and Information Science 100 Cerstin Mahlow Michael Piotrowski (Eds.) Systems and Framewo...

Author: Cerstin Mahlow | Michael Piotrowski

66 downloads 814 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Communications in Computer and Information Science

100

Cerstin Mahlow Michael Piotrowski (Eds.)

Systems and Frameworks for Computational Morphology Second International Workshop, SFCM 2011 Zurich, Switzerland, August 26, 2011 Proceedings

13

Volume Editors Cerstin Mahlow University of Basel Nadelberg 4, 4051 Basel, Switzerland E-mail: [email protected] Michael Piotrowski University of Zurich Binzmühlestr. 14, 8051 Zurich, Switzerland E-mail: [email protected]

ISSN 1865-0929 e-ISSN 1865-0937 ISBN 978-3-642-23137-7 e-ISBN 978-3-642-23138-4 DOI 10.1007/978-3-642-23138-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011933917 CR Subject Classiﬁcation (1998): I.2.7

© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Morphological resources are the basis for all higher-level natural language processing applications. Morphology components should thus be capable of analyzing single word forms as well as whole corpora. For many practical applications, not only morphological analysis, but also generation is required, i.e., the production of surfaces corresponding to specific categories. Apart from uses in computational linguistics, there are numerous practical applications that either require morphological analysis and generation, or that can greatly benefit from it, for example in text processing, user interfaces, or information retrieval. These applications have specific requirements for morphological components, including requirements from software engineering, such as programming interfaces or robustness. With the workshop on Systems and Frameworks for Computational Morphology (SFCM) we have established a place for presenting and discussing recent advances in the field of computational morphology. In 2011 the workshop took place for the second time. SFCM focuses on actual working systems and frameworks that are based on linguistic principles and that provide linguistically motivated analyses and/or generation on the basis of linguistic categories. SFCM 2009 focused on systems for a specific language, namely, German. The main theme of SFCM 2011 was phenomena at the interface between morphology and syntax in various languages: Many practical applications have to deal with texts, not just isolated word forms. This requires systems to handle phenomena that cannot be easily classified as either “morphologic” or “syntactic.” The workshop thus had three main goals: – To stimulate discussion among researchers and developers and to offer an up-todate overview of available morphological systems for specific purposes. – To stimulate discussion among developers of general frameworks that can be used to implement morphological components for several languages. – To discuss aspects of evaluation of morphology systems and possible future competitions or tasks. Based on the number of submissions and the number of participants at the workshop we can definitely state that the topic of the workshop was met with great interest from the community, both from academia and industry. We received 13 submissions, of which 8 were accepted after a thorough review by the members of the Program Committee and additional reviewers. The peer-review process was double-blind, and each paper received four reviews. In addition to the regular papers, we had the pleasure of Lauri Karttunen giving an exciting invited talk on new features of the Finite-State Toolkit (FST).

VI

Preface

The discussions after the talks and during the demo sessions, as well as the final plenum, showed the interest in and the need and the requirements for further efforts in the field of computational morphology. We will maintain the website for this workshop at http://sfcm2011.org. This book starts with the invited paper by Lauri Karttunen (“Beyond Morphology: Pattern Matching with FST"), reporting on new developments for the Finite-State Toolkit, an enhanced version of XFST. The FST pattern matching algorithm allows applications like tokenizing, named-entity recognition, or even parsing. Then follows a paper by M¯arcis Pinnis and K¯arlis Goba (“Maximum Entropy Model for Disambiguation of Rich Morphological Tags"), describing a statistical morphological tagger for Latvian, Lithuanian, and Estonian. The authors explore the use of probabilistic models with maximum entropy weight estimation to cover the rich morphology in these languages. The paper by Benoît Sagot and Géraldine Walther (“Non-canonical Inflection: Data, Formalisation and Complexity Measures") deals with non-canonical inflection, a popular topic in linguistics, but lacking implementation. Representing inflectional irregularities as morphological rules or as additional information in the lexicon allows the implementation within the Alexina framework. The approach holds for several morphologically rich languages like French, Latin, Italian, Sorani Kurdish, Persian, Croatian, and Slovak. The following paper of Gertraud Faaß (“A User-Oriented Approach to Evaluation and Documentation of a Morphological Analyser") emphasizes the need for usercentered evaluation of morphological components. The paper by Krister Lindén, Erik Axelson, Sam Hardwick, Tommi Pirinen, and Miikka Silfverberg (“HFST—Framework for Compiling and Applying Morphologies") reports on the new version of the HFST framework, allowing users to experiment with several finite-state tools for various languages to use in open-source projects. Then follows a paper by Esmé Manandise and Claudia Gdaniec (“Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing in Machine Translation") covering morphological issues in machine translation of e-mail messages from Spanish to English when bilingual authors use borrowing, code-mixing, or code-switching. The last three papers report on morphological systems for specific languages: Arabic, Indonesian, and Swiss German. Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, and Josef Van Genabith (“A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer") report on the creation of resources for modern standard Arabic. The paper by Septina Dian Larasati, Daniel Zeman, and Vladislav Kuboˇn (“Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus") describes the development of a robust finite state open source morphology tool for Indonesian, motivated by shortcomings of existing resources. The paper by Yves Scherrer (“Morphology Generation for Swiss German Dialects") provides insights into dialectological issues for generation. Although there is a lot of research on Swiss German dialects in the field of linguistics, there is currently only very little related research in the NLP community. The contributions show that high-quality research is being conducted in the area of computational morphology: Mature systems are further developed and new systems and

Preface

VII

applications are emerging. Even though other languages are becoming more important, research in computational linguistics still focuses primarily on English, which is well known for its reduced morphology. Morphological analysis and generation are thus often regarded as being required only for the processing of some exotic languages. The papers in this book come from eight countries, discuss a wide variety of languages from many different language families, and illustrate that, in fact, a rich morphology is better described as the norm rather than the exception—proving that for most languages, as we have stated above, morphological resources are indeed the basis for all higher-level natural language processing applications. The trend toward open-source developments still goes on and evaluation is considered an important issue. Making high-quality morphological resources freely available will help to advance the state of the art and allow the development of high-quality real-world applications. Useful applications with carefully conducted evaluation will demonstrate to a broad audience that computational morphology is an actual science with tangible benefits for society. We would like to thank the authors for their contributions to the workshop and to this book. We also thank the reviewers for their effort and for their constructive feedback, encouraging and helping the authors to improve their papers. The submission and reviewing process and the compilation of the proceedings were supported by the EasyChair system. We thank Alfred Hofmann, editor of the series Communications in Computer and Information Science (CCIS), and the Springer staff for publishing the proceedings of SFCM 2011. We are grateful for the financial support given by the German Society for Computational Linguistics and Language Technology (GSCL) and the general support of the University of Zurich. June 2011

Cerstin Mahlow Michael Piotrowski

Organization

The Second Workshop on Systems and Frameworks for Computational Morphology (SFCM 2011) was organized by Cerstin Mahlow and Michael Piotrowski. The workshop was held at the University of Zurich.

Program Chairs Cerstin Mahlow Michael Piotrowski

University of Basel, Switzerland University of Zurich, Switzerland

Program Committee Bruno Cartoni Simon Clematide Axel Fleisch Piotr Fuglewicz Thomas Hanneforth Roland Hausser Lauri Karttunen Kimmo Koskenniemi Winfried Lenders Krister Lindén Anke Lüdeling Cerstin Mahlow Günter Neumann Michael Piotrowski Adam Przepiórkowski Christoph Rösener Helmut Schmid Angelika Storrer Pius ten Hacken Eric Wehrli Andrea Zielinski

University of Geneva, Switzerland University of Zurich, Switzerland University of Helsinki, Finland TiP Sp. z o. o., Katowice, Poland University of Potsdam, Germany Friedrich-Alexander University of Erlangen-Nuremberg, Germany Stanford University, USA University of Helsinki, Finland University of Bonn, Germany University of Helsinki, Finland Humboldt University Berlin, Germany University of Basel, Switzerland DFKI Saarbrücken, Germany University of Zurich, Switzerland Polish Academy of Sciences, Warsaw, Poland Institute for Applied Information Science, Saarbrücken, Germany University of Stuttgart, Germany University of Dortmund, Germany Swansea University, UK University of Geneva, Switzerland FIZ Karlsruhe, Germany

X

Organization

Additional Reviewers Johannes Handl Besim Kabashi

Friedrich-Alexander University of Erlangen-Nuremberg, Germany Friedrich-Alexander University of Erlangen-Nuremberg, Germany

Local Organization Cerstin Mahlow Michael Piotrowski

University of Basel, Switzerland University of Zurich, Switzerland

Sponsoring Institutions German Society for Computational Linguistics and Language Technology (GSCL) University of Zurich

Table of Contents

Beyond Morphology: Pattern Matching with FST . . . . . . . . . . . . . . . . . . . . . . . . Lauri Karttunen

1

Maximum Entropy Model for Disambiguation of Rich Morphological Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M¯ arcis Pinnis and K¯ arlis Goba

14

Non-canonical Inflection: Data, Formalisation and Complexity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benoît Sagot and Géraldine Walther

23

A User-Oriented Approach to Evaluation and Documentation of a Morphological Analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gertrud Faaß

46

HFST—Framework for Compiling and Applying Morphologies . . . . . . . . . . . . . Krister Lindén, Erik Axelson, Sam Hardwick, Tommi A. Pirinen, and Miikka Silfverberg Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing in Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Esmé Manandise and Claudia Gdaniec A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, and Josef van Genabith

67

86

98

Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Septina Dian Larasati, Vladislav Kuboˇn, and Daniel Zeman

119

Morphology Generation for Swiss German Dialects . . . . . . . . . . . . . . . . . . . . . . Yves Scherrer

130

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141

Beyond Morphology: Pattern Matching with FST Lauri Karttunen Stanford University, Palo Alto, USA

Abstract. FST stands for Finite-State Toolkit. It is an enhanced version of the XFST tool described in the 2003 Beesley and Karttunen book Finite State Morphology. Like XFST , FST serves two purposes. It is a development tool for compiling finite-state networks and a runtime tool that applies networks to input strings or files. XFST is limited to morphological analysis and generation. FST can also be used for other applications. This paper describes the new features of the FST regular expression formalism and illustrates their use for named-entity recognition, relation extraction, tokenization and parsing. The FST pattern matching algorithm ( ) operates on a single pattern network but the network can be the union of any number of distinct pattern definitions. Many patterns can be matched simultaneously in one pass over a text. This is a distinct FST advantage over pattern matching facilities in languages such as Perl and Python. Keywords: finite-state automata, tokenization, pattern matching.

1 Creating Pattern Networks Most of the FST commands are described in the chapter on the XFST application in the Finite State Morphology book by Kenneth R. Beesley and Lauri Karttunen [1].1 The new features of FST include a special command, , for applying a pattern network to a text and many enhancements to the regular expression formalism for defining networks. 1.1 Simple Patterns The command in FST expects two arguments: a name and a regular expression. It compiles the regular expression and binds the resulting network to the name. The name can then be used in subsequent regular expressions to refer to the network. For example,

defines a network containing four fruit names. The following definition creates a network for recognizing dollar amounts such as $5.10. 1

C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 1–13, 2011. © Springer-Verlag Berlin Heidelberg 2011

2

L. Karttunen

!

" #

$%&

#%& '( )

#%& *+

The section in square brackets defines whole numbers from 0 up to any length. The section in parentheses defines the optional decimal part of a number. In the FST regular expression formalism, round parentheses indicate optionality. To create a simple pattern network for matching fruits and prices, we first define and adding a final “end tag” transition:

, -

.), +

/ -

.)/+

We can now make a union of the two networks for pattern matching. 0 , /

The command compiles a regular expression and makes the resulting network available for application. The construct creates a pair symbol, , that has an epsilon (represented by zero) on the input side of the network and a closing XML tag on the output side. The network resulting from the union recognizes fruit names and dollar amounts. The purpose of the end tags is to indicate in the output which of the two patterns was matched. If we now invoke the FST pattern matching command, , on the input An apple costs $1.05 today., the output tags apple as an instance of the pattern and $1.05 as a . !$#1 2 3, 4 3, 4 3/4!$#13/4 2

Here the function inserts an initial XML tag on the fly in front of a string identified by a closing XML tag in the pattern network. The parts of the input string that do not match any patterns are echoed into the output unchanged. Wrapping paired XML tags around matches is the default output mode of , but there are other output options. For example, if we are just interested in the locations of the matches, say, for the purpose of highlighting them in the text, we can instruct to print just the location information ignoring everything that does not match. %

!$#1 2 51 3, 4 $11!$#13/4

Here the output of indicates the beginning byte position of the match, the number of bytes matched, the string itself and its initial tag. If the price was indicated in euros instead of dollars, û, the length of the match would be 7 instead of 5 because the UTF-8 representation of the euro symbol C consumes three bytes instead of just one for $. The regular expression compiler in FST has a few new types of symbols not documented in [1]. For example, any collection of symbols may be defined as a LIST: $ &

$%&

The list contains the digits from 1 to 9. FST comes with many system-defined lists such as , , !, , etc. An equivalent definition for list would be

Pattern Matching with FST

$ & %

3

#

The expression "# creates the symbol $"%$ that matches any of the nine digits in the list . The expression creates the symbol $&% $ that matches any symbol that is not a digit. The definition of '! given above can be stated more compactly using list membership symbols instead of enumerating the digits explicitly:

!

" #

6 )$ &+ 6 ) +'(

)

6 ) +*+

List membership symbols make it possible to represent a language in a smaller network. For example, the non-list expression ( represents the union )*+*,*-**.*/* 0*1. It compiles into a network with nine arcs, one for each digit, whereas the network for "# contains just one arc.

0

@L.1to9@

1

Fig. 1. The language 6 )$ &+

Another useful innovation in FST regular expressions is the notion of an INSERT symbol. If is defined as a network, the expression # creates an insert symbol $ % $. When the routine encounters an arc labeled $ % $ it traverses it only if it can match a string in the network. Taking advantage of insert symbols, we could replace our earlier definition of and by

, , ) + -

.), +

/ , ) + -

.)/+

and create a network for matching the two patterns with 0 , ), + , )/+

creating the network in figure 2.

0

@I.Item@ @I.Price@

1

Fig. 2. Pattern for fruit items and prices

As the example shows, a network referred to by an insert symbol may itself contain insert symbols. For example, in order to traverse the $ % $ arc in figure 2, has to push from the network into the '! network.2 2

We come back to this topic in section 3.2.

4

L. Karttunen

1.2 Expanded Patterns Lists of strings can be modified to include upper case and capitalized versions by using some of the predefined FST functions. Functions such as 2 3#& take a regular expression as an argument, compile it with some modification and return the resulting network. Here are some examples of built-in functions for case conversion. 78) 2 + 78)+ 78)A 6+ = 78)+ 8) 2 + = 8) 2 + FST

994 994 994 994 994 994

:-; <=>? @ @ @A : < : < A : 2 A < A 2

allows the user to define new functions such as the example below.

B08)C+ "= 8)C+ 78)C+( 0 B08) +

The expression 4 3#5! expands the list of fruit names with capitalized and upper case version of each word. Instead of apple, we now have apple, Apple and AP PLE. A definition of a particularly interesting function, 6 ! &, is given in Appendix 2. 6 ! expands apple to apple and apples. As shown below, these two functions can be nested.

-0

)C+ B08)= / )C++

For example, 5! gives us the lower-case, upper-case and capitalized versions of apple, apples, peach, peaches, etc. 1.3 Relations Fruit items and dollar prices are simple examples of “named entities.” A RELATION joins two or more entities. An obvious relation for items and prices is COST . There are many ways to express the idea that X costs Y. For example, we might define a minilanguage of cost phrases as follows:

8 / D 2

Given the earlier definitions, we can now create a network for extracting or marking cost relations.

;E F 0 , ), + ;E , )8 /+ ;E , )/+ -

.)8 +

With this pattern the routine marks items and prices that are related by a cost phrase: !$#1 2 38 43, 4 3, 4 3/4!$#13/438 4 2 !$1# 38 43, 43, 4 3/4!$1#3/438 4

Pattern Matching with FST

5

If we are primarily interested in pairing items and prices but not in the particular way the cost relation is expressed, we can create alternate definitions, 789 and 3# #9, transducers that have to match the input for to succeed but produce no output because they have only epsilons on the output side.

;EG

8 /G

#;E #8 /

With these definitions, we can produce a more minimalistic output. 0 , ), + ;EG , )8 /G+ ;EG , )/+ -

.)8 + %

!$#1 2 5$H3, 4 3, 43/4!$#13/438 4

1.4 Context Conditions A pattern definition may include constraints on the context. A constraint is a condition that has to be met for a string to count as a valid match for a pattern without itself being a part of the pattern. For example, we could decide that apple and $1.05 should be marked as being in the 3# relation in examples such as Whole Foods sells an apple for $1.05. There is a class of phrases for commercial transactions that can be used as an indication that a Y is a cost for X when X and Y are separated by for. In FST this idea can be encoded in the following way;

8 / 2 D

I 0 , ), + ;EG # ;EG , )/+ -

.)8 + 68)8 / ;E )I ;E++ %

; !$#1 J$$13, 4 3, 43/4!$#13/438 4

Here the end tag of the 3# pattern, 3#, is followed by a condition on the left context: "33 # 78 : 78. The effect is that a successful match for the 3# pattern, apple for $1.05, counts as a valid 3# expression only if it is preceded by a phrase such as sells an, that is, a commerce phrase and a determiner. Context constraints can be used to disambiguate ambiguous expressions such as 1/4. 1/4 can be interpreted as a fraction (one fourth) or as a date (Jan. 4). If we construct a pattern that includes both possibilities will always recognize 1/4 both as a fraction and as a date: 0 $K "-

.) + -

.)I +( $K 3 43I 4$K3I 43 4

The network is shown in figure 3. However, in many contexts 1/4 is disambiguated by the surrounding words as in 1/4 of voters, 1/4 or before. A following preposition such as of is a positive right context, RC, for the fraction interpretation and a negative right context, NRC, for the date reading. The following FST regular expression correctly distinguises the two cases:

6

L. Karttunen

1

0

1

/

2

4

3

:0 :0

4

Fig. 3. Ambiguity of 1/4

0 $K

"-

.) + >8) + -

.)I + :>8) +( $K D 3 4$K3 4 D $K 3I 4$K3I 4

Figure 4 shows a network with of as a positive condition for 5 and as a negative context for :. :0 0

1

1

/

2

4

3

4

:0

NRC RC

6

""

7

o

8

f

9

5

Fig. 4. Right context conditions

Expressions like 1/4 can also be disambiguated by a preceding word. 1/4 must be a fraction in under 1/4 but a date in due on 1/4. In the following expression LC indicates that under is a positive left context for the fraction interpretation and NLC marks it as a negative left context for the date reading. 0 $K "-

.) + 68)

+ -

.)I + :68)

+(

$K

3 4$K3 4 $K

3I 4$K3I 4

Finally, left and right-context conditions may be combined with an AND or with an The following regular expression stipulates that 1/4 is tagged as a fraction if it is either preceded by under or followed by of. For the date reading, these are both negative contexts. OR.

0 $K "-

.) + =>) 68)

+A >8) ++ -

.)I + :I):68)

+A :>8) ++(

The resulting network is shown in figure 5. This example demonstrates that FST can encode any context condition on regular languages that is expressible in propositional logic. In a manner similar to De Morgan’s Law, negation is pushed down to the NLC and NRC constraints: ¬(p ∨ q) is equivalent to (¬p ∧ ¬q). The context conditions are compiled as part of the pattern network in the usual way except that the subnetworks for left-context conditions, LC and NLC, are reversed. This

Pattern Matching with FST

:0 0

1

1

/

2

4

3

4

NRC

AND

6

NLC

9

""

8

""

o

11

7

f

13

u

:0

OR

5

RC

7

r

10

e

12

14

d

16

n

15

17

LC

Fig. 5. AND and OR conditions for 1/4

is because they are checked by going right-to-left in the input. Figure 6 shows the piece of the network FST compiles from "3;! <. The first symbol to be checked after the "3 marker is a space followed by the letters of under in reverse order. 0

LC

1

""

2

r

3

e

4

d

5

n

6

u

7

Fig. 6. LC({under })

In this toy example, the left context consists of a single string but it can of course be a set of phrases as in the case of 3 # above. Any regular language can be used as a context condition.

2 The Algorithm The algorithm is based on four simple ideas: 1. 2. 3. 4.

The left-to-right longest-match principle. Only the end of a pattern needs to be labeled. Multiple patterns can be matched in parallel. The same string can be a match for more than one pattern.

If an input symbol matches the first symbol of a pattern the algorithm fetches the next symbol from the input and continues as long as it can follow a path in the pattern network. If a valid match is found, the algorithm produces an output and restarts from the end of the matching string. If the match is unsuccessful, moves one symbol to the right from the previous starting point in the input and tries again. The left-to-right longest-match principle entails that a match is valid only if a longer match cannot be found. If finds a longer match for a pattern, the output for any previously found shorter match is discarded. In the following example recognizes apple by itself but not when it is a prefix of a longer match apple pie. 0 " ( -

.) + ; L ; 3 4 3 4 3 4 3 4L

If the same string is a match for two patterns, as in figure 3, returns both results. If an end tag arc leads to a non-final state as in figures 4 and 5, the match is successful only if the condition or conditions emanating from that state are satisfied. RC and NRC

8

L. Karttunen

conditions are checked by going forward in the input, LC and NLC conditions require going left from the start of the potential match. If the check leads to a final state, it is a success for the LC and RC conditions and a failure for the NLC and NRC conditions. The algorithm makes certain assumptions about valid starting and ending points for a match. For example, if all of the words in the pattern network start and end with an alphabetic symbol, does not restart matching unless the last symbol seen was a non-alphabetic symbol. Given the definition of 5! given in section 1.1, does not recognize the string apple when it is part of a longer word as in crabapples.

3 Other Applications Although was originally designed for simple pattern matching, it has other applications such as REDACTION. Instead of marking or locating instances of a pattern, FST can be instructed to delete them or to redact them in some other way. Other useful applications include tokenization and parsing. 3.1 Tokenization The algorithm takes a special action when it matches a pattern that has a closing end tag but it does not require that there be such a tag. If a pattern matches and there is no special end tag, outputs what it has matched with whatever modifications the pattern network makes. For example, in the following case: 0 M E E M

we get the output She said bonjour to me in French. because recognizes the string good morning and maps it to bonjour. The rest of the input is echoed into the output unchanged because nothing else matches a pattern. As the following example shows, in this tagless mode can be used for tokenization.

;E F "6 )+*(

; 6 ) +*

I 6 ) + F # >8) + F # 6 ) + 68) +

B ; : < : < 2 0 ;E ; I , )B ; + : < A 2 A : < A 2

A D

: <

D : <

Pattern Matching with FST

9

This mini-tokenizer preserves whitespace inside a multiword word expressions such as New York. Elsewhere whitespace is normalized to a single newline. Delimiter characters such as punctuation marks are split off from alphanumeric strings by newlines unless they are part of a declared multiword string. Compared to the techniques for tokenization presented in Chapter 9 of the B&K book, the algorithm has a great advantage because it can accommodate large lists of multiword expressions. This is due to the fact that the left-to-right longest-match principle is enforced at runtime by . It is not encoded physically in the state and arc space of a tokenizing transducer as [1] requires. As the authors themselves admit (p. 431), introducing many multiword expressions makes tokenizing transducers “blow up” in size. 3.2 Parsing As we already saw in section 1.1, networks referred to by an insert symbol may themselves contain inserts with no limit on the depth of the nesting.

I : 2 / :/ , )I+ , ):+ ) , )//++ // , )/+ , ):/+

Here the definition of = refers to the definitions of :, and

by way of inserts, and the definition of

refers back to the definition of = , and vice versa. The

pattern includes strings such as with the girl and on this side that contain a simple noun phrase, and it also includes strings such as with the girl from the city, on this side of the ocean and with the girl from the city on the ocean where the = contains one or more embedded

s. This is an example of an RTN [3], a RECURSIVE TRANSITION NETWORK. If we choose to add an end tag only to the top level pattern component, as in the first example below, the internal structure of the = is not shown in the output. 0 , ):/+ -

.):/+ E 2 E 3:/4 2 3:/4

If we add an end tag to each of the individual definitions, will ouput a complete parse tree:

I " ( -

.)I+ : " 2 ( -

.):+ / " ( -

.)/+ :/ , )I+ , ):+ ) , )//++ -

.):/+ // , )/+ , ):/+ -

.)//+

0 , )//+ , ):/+ 2 3:/43I43I4 3:4 23:43:/4

3//43/4 3/4 3:/43I4 3I4 3:4 3:43:/43//4

10

L. Karttunen

Although the = pattern in the above example is recursive, the same = language can also be defined using iteration. The last element in the definition of = , 78 #

, could be replaced by ";E , )/+ ;E , )I + ;E , ):+('

Any example of “tail recursion” such as this one can be replaced by iteration showing that language is in fact regular (finite-state). But recursive transition networks can also represent non-regular context-free languages as the following example shows.

;E

I

8:

N.

=M> , ):/+ ;E , )N +

> ;E =M>

: , )8:+ ;E , )> +

:/ , )I + ;E , ): + 0 , ):/+ -

.):/+ 3:/4 3:/4

In this example, the = language includes strings such as the mouse that the cat chased as well as the cheese that the mouse that the cat chased ate. This last phrase shows that the definition of 6>?@ introduces an embedded = in the center of another = . This kind of recursion cannot be converted to iteration. Because the routine in FST correctly interprets insert symbols in such cases, it can recognize context-free languages that are not regular.

4 Conclusion is mature industrial software. The beginnings of it date back to the early 1980s.3 The pattern matching capability of FST has evolved over the past ten years. It was first publicly disclosed in a PARC Technical Report [2] which the present article is based on. Although it is possible to use FST for context-free parsing, it is most likely less efficient for that application than a chart parser would be. FST is best applied to matching patterns that are regular languages. Compared to other similar tools, the main asset of FST is that it can compile any number of patterns into a single network and apply all the patterns in parallel in a single run. The networks can be created with the help of a regular expression calculus that is more expressive than the regular expression formalism found in tools such as Perl or Python. It is very likely that, by the time of the SFCM 2011 meeting, FST will be an open source project, freely available for non-commercial purposes. FST

3

The first version of FST was created at the Xerox Research Centre Europe ( XRCE) in Meylan, France, in the mid 1990s. FST is writen in C . It runs on 32 and 64 bit versions of Linux, MacOSX and Windows operating systems. For the past several years FST has been maintained and improved by the Palo Alto Research Center (PARC), a Xerox Company, in collaboration with Powerset, now part of Microsoft.

Pattern Matching with FST

11

Acknowledgments. Thanks to Kenneth R. Beesley, Kyle Dent, Ronald M. Kaplan, John T. Maxwell III and Annie Zaenen for their helpful comments and suggestions on earlier drafts of this paper.

References 1. Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI Publications, Palo Alto (2003) 2. Karttunen, L.: Pattern Matching with FST – A Tutorial. Technical Report TR-2010-01. Palo Alto Research Center, Palo Alto, CA (2010) 3. Woods, W.A.: Transition Network Grammars of Natural Language Analysis. Comm. ACM 13(10), 591–606 (1970)

Appendix 1: A Simple Example of an FST Pattern Matching Script This small example illustrates how FST compiles patterns from files and regular expressions and applies them to an input file. The set of FST commands shown later presupposes the six source files below. Three of them contain names, two contain a regular expression, and the last one is an input file. O 0 8 8

8

Q ? 2 / :

Q2 Q O / : ) ) 6 ) +R5 O E E: 6 ) +R5 %

O I 0 E P

P P E

B +

)

++

6 ) +RJ

6 ) +R5 %

O B D 0 . Q I . 8 . : 2 : . E 6 %

6 ) +RK

6 ) +RK

O 0 B2 2 $J5%K1%SHT& < )S1#+ T$J% K1SH Q ? 2

Q2 Q . 8 . 8 2 P . Q I

The first three FST commands below define networks for Actors, Movies, and Dictators. The $ operator instructs the regular expression compiler to compile a text file into a network. The effect of the 6 3 & " function is to make initial capitalization optional on the lower side of X, that is, on the input side of .

= 8) 0 0 A 6+ -

.) +

B D = 8) 0 B D 0 A 6+ -

.)B D+

I = 8) 0 I 0 A 6+ -

.)I +

The next command defines a pattern network for our five types of entities. The $ operator instructs the regular expression compiler to process a file as a regular expression.

12

L. Karttunen

/ , ) + , )I + , )B D+ / : -

.)/ :+ E E: -

.)E E:+

Because dashes might be followed by a line break in the input, it is useful to compose the pattern network with a transducer that inserts an optional newline character after a dash on the input side of the network. This allows a line break after a dash, that is, an optional A label, in the pattern. 0 /

"( )%4+ F

%

U

We set FST into locate-patterns mode and apply the network we just compiled into the input file. %

%

3 0

Here is the ouput: J&$$$J5%K1%SHT&3E E:4 S$$1)S1#+ T$J%K1SH3/ :4 HT$$Q ? 23 4 &K$#Q2 Q 3 4 $$H$S. 8 .3B D4 $5S$18 8 3 4 $1&SP 3I 4 $S&$T. Q I 3B D4 /

O %%%%%%%%%%%%%%%%%%%%%%%%%%%% 3 4 5 3I 4 $ 3B D4 J 3/ :4 $ 3E E:4 $ %%%%%%%%%%%%%%%%%%%%%%%%%%%% . T

Note that chaplin was normalized to Chaplin because capitalization on the input side was made optional. The newline after 812- in the input was mapped to an epsilon with the effect of stitching (650) 812-An and 4567 into (650) 812-4567 in the output. The byte count, 15, includes the suppressed newline.

Appendix 2: Examples of FST Function Definitions Here is the definition of 6 ! & mentioned in section 1.2. This script first defines the function ! & that maps an English word to its plural form or forms.

Pattern Matching with FST

13

Irregular singular:plural pairs are listed directly. Regular plurals are produced by a cascade of five replace rules. The irregular and regular forms are combined by the priorityunion operator, % %, that chooses an irregular form over a regular one if an irregular form exists. If a particular word has no irregular plural, a regular plural is produced. The 6 ! & function defined at the end of the script maps a singular English word to the union of itself and its plural or plurals, whether they be regular or irregular. For example, 6 ! & maps peach to peach and peaches, wolf to wolf and wolves, sheep to itself. 6 ! ; !< pairs millennium with millennium, millennia and millenniums because the two alternative plurals are both acceptable in current American English. define Irregular [{bacterium}:{bacteria} | {child}:{children} | {corpus}:{corpora} | {crisis}:{crises} | {criterium}:{criteria} | {curriculum}:{curricula} | {datum}:{data} | {deer} | {fish} | {foot}:{feet} | {genus}:{genera} | {goose}:{geese} | {hypothesis}:{hypotheses} | {louse}:{lice} | {man}:{men} | {medium}:{media} | {memorandum}:{memoranda} | {millennium}:{millennia} | {millennium}:{millenniums} | {mouse}:{mice} | {oasis}:{oases} | {offspring} | {ox}:{oxen} | {phenomenon}:{phenomena} | {sheep} | {stratum}:{strata} | {thesis}:{theses} | {tooth}:{teeth} | {woman}:{women}]; define AddE f | s | {sh} | {ch} | x | z | {acco}|{alvo}|{ando}|{ango}|{anjo}|{argo}|{asco}|{asso}| {atto}|{cano}|{cebo}|{digo}|{echo}|{ecko}|{egro}|{endo}| {ento}|{esco}|{esto}|{etto}|{falo}|{fico}|{hako}|{halo}| {hero}|{hobo}|{ildo}|{illo}|{ingo}|{into}|{irdo}|{lago}| {lico}|{mago}|{mato}|{mino}|{nado}|{nkgo}|{otto}|{pedo}| {rado}|{rago}|{tato}|{tico}|{tigo}|{ucco}|{uito}|{utgo}| {veto}|{viso}|{zebo}; define Vowel a | e | i | o | u ; define FtoV {cal}|{dwar}|{el}|{hal}|{hoo}|{kni}|{lea}|{li}|{loa}|{scar}| {sel}|{shea}|{shel}|{thie}|{whar}|{wi}|{wol}; define Regular [..] -> %. || _ .#. .o. %. -> {es} || AddE _ .#. .o. # buses, heroes y %. -> {ies} || \Vowel _ .#. .o. # ladies %. -> s || _ .#. .o. # days, pianos f -> v || .#. FtoV _ {es} .#. ; # wolves define Plural(X) [[X .o. Irregular] .P. [X .o. Regular]].l; define OptPlural(X) X | Plural(X);

Maximum Entropy Model for Disambiguation of Rich Morphological Tags M¯arcis Pinnis and K¯arlis Goba Tilde, Vienibas 75a, LV-1004 Riga, Latvia

Abstract. In this work we describe a statistical morphological tagger for Latvian, Lithuanian and Estonian languages based on morphological tag disambiguation. These languages have rich tagsets and very high rates of morphological ambiguity. We model distribution of possible tags with an exponential probabilistic model, which allows to select and use features from surrounding context. Results show significant improvement in error rates over the baseline, the same as the results for Czech. In comparison with the simplified parameter estimation method applied for Czech, we show that maximum entropy weight estimation achieves considerably better results. Keywords: Tagger, maximum entropy, inflective languages, Estonian, Latvian, Lithuanian.

1 Introduction The scope of this work covers three languages—Estonian, Latvian and Lithuanian, all of which have rich nominal and verbal morphology. While inflections in Estonian are formed agglutinatively, Latvian and Lithuanian share similar fusional morphology. All three languages exhibit high ambiguity of possible morphological analyses of a word, which in the case of Latvian and Lithuanian can be explained by their fusional nature, with several inflections sharing the same morphemes. In Estonian some agglutinative morphemes are shared between several inflections, producing homonymous surface forms. 1.1 Morphological Tagging Morphological tagging can be viewed as a classification problem for a given word sequence (typically sentence), where each word is assigned a single tag describing its morphological properties. In this work, all three languages are processed within the same framework. Morphological analysis of a word (or in general, token) is encoded in a single tag consisting of fixed number of subtags corresponding to certain morphological categories (e.g., part of speech, gender, number, etc.). Like in similar work for Czech [2], we take a two-step approach to tagging, where a token is first analyzed for possible morphological tags and disambiguated separately. C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 14–22, 2011. c Springer-Verlag Berlin Heidelberg 2011

Disambiguation of Rich Morphological Tags

15

POS adjective GENDER male NUMBER plural CASE nominative DEGREE positive DEFINITENESS indefinite Fig. 1. Example of a morphological tag for the Latvian word pašsaprotami (lit. self-evident)

The morphological analyzer is based on a lemma lexicon and inflectional rules, and produces one or several analyses for a given word. The tagger then disambiguates the analysis by estimating probabilities of individual analyses and selecting the most probable. In this work, we used an unified morphological analyzer consisting of a rule-based analysis module for Latvian and Lithuanian (developed by Tilde), and a separate analysis module for Estonian [6] (developed by Filosoft). 1.2 Morphological Tagset The notion of tagset includes the set of valid combinations of subtags. Some subtags are mutually independent (e.g. a noun can decline in number and case independently), while others are valid only in certain contexts (e.g. tense is only valid for verbs). The morphological tagset used for all three languages is similar to MULTEXT-East format [7] and consists of 28 categories. Each category is represented as a singlecharacter subtag (see figure 1 for an example tag), with ‘0’ corresponding to no value. While each language uses its own subset of all categories and their values, the category positions within the morphological tag and their meanings remain fixed.

2 Training Data The training data (see figure 2 for a sample) for the morphological tagger consists of multiple lines; where each line represents a token and a sequence of possible tags. Sentences are separated with an empty line. The sequence of tags is given by the morphological analyzer of the particular language. The first tag is always the correct (manually annotated) tag. If the morphological analyzer does not recognize a token, it returns an empty tag. We assume that the morphological analyzer has recognized all tokens, thus the morphological tagger does not process unknown words and the tagging task is reduced to a morphological disambiguation task for known tokens. We use morphologically disambiguated corpora for each of the three languages (Estonian, Latvian and Lithuanian) to train and test the morphological tagger. Internal corpora were used for Latvian and Lithuanian, which consist of fiction, newspaper articles, scientific papers, business reports and letters, government documents, legal documents, student essays and theses, IT documents (such as manuals and web site information) and forum comments. Latvian and Lithuanian corpora were

16

"

M. Pinnis and K. Goba

!

Fig. 2. Latvian training data excerpt (lit. all was at end)

pre-tagged using a morphological analyzer and then given to annotators for manual disambiguation. Due to budget limitations, each token has been disambiguated only by one annotator, which lowers the corpus quality and creates unnecessary noise in the corpora. For the Estonian tagger a freely available morphologically disambiguated corpus [9] was used, which consists of fiction, legal, newspaper and scientific texts. In this corpus, each word has been annotated by two annotators and disagreements have been resolved by a third annotator, thereby increasing the corpus quality. The Estonian corpus tagset is different to our unified tagset, therefore it had to be converted to the Multext-East tagset using a one-to-one transformation and a transformation from Multext-East to our unified tagset with some minor transformations to adjust the corpus to our unified morphological analyzer. In order to create the training data for the morphological tagger, the ambiguous tag sequence had to be created, therefore, the corpus was preprocessed also with our morphological analyzer. After disambiguation, the corpora were split into training and test data so that none of the test sentences would be present in the training data. The final corpora statistics is shown in table 1. Table 1. Training and test corpora

Total tokens Sentences Ambiguous words, % Word OOV rate Distinct tags Tag perplexity Test data, % Test tokens

Estonian

Latvian

Lithuanian

419,137 31,266 32.4% 1.5%

117,362 6,564 48.5% 3.0%

71,460 4,201 36.0% 2.3%

268 48.86

1401 184.46

1052 125.60

6% 26,366

10% 12,826

10% 8,103

2.1 Ambiguity Classes Following the work for Czech [2], we use the notion of ambiguity class to describe possible morphological ambiguities within a subtag. For example, ambiguity class POSan describes part of speech ambiguity between noun and adjective. There are in total 216, 250 and 259 ambiguity classes throughout 22, 20 and 14 ambiguous morphological categories in the Latvian, Lithuanian and Estonian language training corpus respectively.

Disambiguation of Rich Morphological Tags

17

3 Model The tagging model is based on the exponential probabilistic model used for Czech [2]. We assume that individual subtags {yPOS , yTENSE , yGENDER , . . .} are independent, and model the probability of a candidate tag as a product of individual subtag probabilities: p(y) =

∏

p(yc ) .

(1)

c∈CAT

The subtag probabilities are modeled separately within each ambiguity class AC. The probability of an event y in context x is modeled as an exponentially weighted sum of feature functions [1]: pΛ (y|x) =

exp ∑i λi fi (y, x) , Z(x)

(2)

where f (y, x) are binary valued feature functions predicting event y in context x, and Z(x) is the normalization factor. Here, events correspond to subtag values in a corresponding morphological category, and features describe the surrounding context of a word in a sentence.

4 Training 4.1 Feature Selection The training of the morphological tagger heavily relies on the feature set used in the training and tagging process as can be seen in the results section. We use binary feature functions, which consist of a context address, function type (for instance, simple types, such as, part of speech, gender, number, also the token itself, or complex types, such as gender, number and case equality with the token whose category is being predicted) and the value of the function type (for example, ‘a’ for part of speech or ‘kas’ for a token in Latvian). We use the value ‘ ’ to define equality of the function type of the token in the address defined by the function and the function type of the token whose category is being predicted. The first line of the feature excerpt in figure 3, therefore, is read in the following way: if the next token is either a conjunction or a comma, the gender, number and case of the second token to the right have to agree with the gender, number and case of the predicted token. Our morphological tagger uses different feature sets for each of the ambiguity classes in the training corpus. Therefore, a feature selection algorithm was used in order to

#$ #$ # ,

% &'$ )*+ )*+ )*+

((

Fig. 3. First four feature excerpt for the Latvian part-of-speech ambiguity class ‘qsv’

18

M. Pinnis and K. Goba

select the best features that describe each of the ambiguity classes. But before the selection algorithm was applied, the initial feature set was generated using all possible categories, events, context position indicators (up to three tokens to the left and right) and some trigger words (conjunctions, prepositions, particles and adverbs) extracted from the training corpus. Although the trigger words increased the precision, the increase was very insignificant (in the order of 10−2 of a percent). This might be due to the fact that the part-of-speech feature functions already express the characteristics of the trigger words and, thus, the increase is very low. The feature generation resulted in 10017, 3801 and 3045 initial features for Estonian, Latvian and Lithuanian respectively. When the initial feature set was created, a simple feature selection algorithm based on the maximal mutual information was used to select the set of feature functions with the highest score for each ambiguity class. The maximal mutual information of a feature function in an ambiguity class is p(x, y)

I(X ;Y ) =

∑ ∑ p(x, y) log p(x)p(y) ,

(3)

y∈Y x∈X

where X = {0, 1} corresponds to the binary value of feature function, Y is the set of possible events in the ambiguity class being processed (for instance, {‘a’, ‘n’} for the ambiguity class ‘an’), p(x) is the probability of the feature function to receive the value x in the context of the ambiguity class, p(y) is the probability of the event y in the ambiguity class and p(x, y) is the probability of the feature function receiving the value x and the event simultaneously being y in the ambiguity class. All probabilities are computed as normalized frequency distributions. Out of all initial feature functions a total of 1684, 775 and 742 feature functions were selected as important by the feature selection algorithm throughout all ambiguity classes for Estonian, Latvian and Lithuanian respectively for the best exponential models (applying a maximum of 150 feature functions in an ambiguity class for Estonian, 100 for Latvian and 50 for Lithuanian). 4.2 Model Parameters We use a maximum entropy library developed at the Tsujii Laboratory of The University of Tokyo [8] to train the models of each of the ambiguity classes. The maximum entropy library features the LMVM (Limited Memory Variable Metric) parameter estimation [5], where parameter re-estimation, in comparison with iterative scaling algorithms, such as IIS (Improved Iterative Scaling) (for instance, in our tests IIS performed up to 30 times slower on the Latvian corpus using 150 features), converges significantly faster [4]. The estimated weights together with the feature sets of all ambiguity classes are combined in a single tagging model, which is used in the tagging process. When disambiguating a token, we use the exponential model (1) to predict all events y in the context x for each ambiguity class of a token. Then we combine the probabilities of separate event predictions using a slightly modified version of the formula (1) for each possible tag [2]: p(y|x) =

∏

c∈CAT

(1 − α )pACc (yc |x) + α pACc (yc ) ,

(4)

Disambiguation of Rich Morphological Tags

19

where we use linear interpolation of the model probability and the probability of the event y in the ambiguity class AC (which, in fact, is the frequency distribution of the event y in the ambiguity class AC) as a smoothing method. The α weights were manually estimated based on the highest training corpus precision. The usage of linear smoothing with the frequency distribution of an event in an ambiguity class has proven to increase the overall precision by 0.2–0.3%.

5 Results We compare the error rates of the exponential model trained on Estonian, Latvian and Lithuanian data (table 2) with HunPos, a HMM trigram tagger [3]. We trained the exponential model with Maximum Entropy parameter estimation and the simplified parameter estimation described in [2]. The baseline error rate is computed using only the category label statistics (with α = 1). HunPos tagger was run in guided mode, with possible morphological tags provided for each token. We trained and evaluated the exponential maximum entropy models on various numbers of selected features (using the maximal mutual information feature selection method) and the best test results were achieved using 150, 100 and 50 features for Estonian, Latvian and Lithuanian respectively. Table 2. Error rates Experiment

Estonian

Latvian

Lithuanian

Baseline HunPos Exponential; simplified estimation Exponential; ME estimation

9.72 8.51 6.98 4.04

14.00 6.67 12.76 8.49

7.47 14.55 6.82 5.65

Exponential; training data

3.07

5.32

3.76

Feature functions

150

100

50

Based on the best exponential maximum entropy models, we also evaluated the individual subtag error rates over all test tokens (table 3). The results suggest that for all languages the error rate distribution is fairly similar (with an exception of Estonian, in which gender is not used), more precisely, the categories with the most misclassifications are: part of speech, gender, number and case; case being the most difficult to predict. 5.1 Error Analysis We have performed error analysis on the Estonian, Latvian and Lithuanian exponential models with 150, 100 and 50 feature functions respectively. For better interpretation of tagging errors, we grouped the errors by differences between the correct and the

20

M. Pinnis and K. Goba Table 3. Error rates within categories

Category POS GENDER NUMBER CASE PERSON TENSE MODE VOICE REFLEX NEGATIVE DEGREE DEFINITENESS DIMINUTIVE PREPNUMBER PREPCASE PREPTYPE NUMTYPE PRONCLASS PARTTYPE VERBTYPE ADVTYPE CONJTYPE

Estonian

Latvian

Lithuanian

2.11 — 1.31 2.15 0.29 0.58 0.31 0.60 — 0.30 0.50 — — — 0.30 0.30 0.24 — 0.02 — — 0.27

1.91 2.67 3.34 4.53 0.37 0.82 0.48 0.51 0.05 0.00 0.80 0.68 0.40 0.60 0.63 0.16 0.02 0.22 0.06 0.43 0.03 0.58

2.33 2.06 2.23 2.64 0.37 0.88 0.67 0.63 0.09 0.05 1.01 0.96 0.05 — 0.12 0.07 0.06 0.57 0.08 0.15 — 0.88

predicted tags. The cumulative error rates of the most common error types for each language (table 4) show that the top six errors cover approximately 50% of all errors in each language training corpus. The error type, for instance, ‘n → g (case)’ given in table 4 explains that instead of the case n (nominative) the case g (genitive) was selected as being more probable. Other error types in the table include number (s - singular, p plural), case (p - partitive, a - accusative) and part of speech (a - adjective, c - conjunction, n - noun, q - particle, r - adverb). When analyzing the top six errors of the Latvian morphological tagger, it can be seen that the errors are fairly regular, for instance, for the error type ‘m → f (gender)’ (as well as for the opposite) a common misclassification is done for the pronoun ‘to’, which is obvious as the gender can either be distinguished by the sentence context (for instance, in noun phrases), by an anaphora resolution or cannot be distinguished at all in the case when the context is too small. As the feature functions do not consider anaphora resolution for pronouns of this type and the context may not reveal the correct gender, the statistical morphological tagger makes misclassifications. Another common misclassification occurs in noun phrases where adjuncts are used, for instance, consider the error type ‘sa → pg (number & case)’. The adjunct number and case in most cases, when observing the context to the right, can be identified, but the tagger makes a

Disambiguation of Rich Morphological Tags

21

Table 4. Top six error types Language

Estonian

Latvian

Lithuanian

Correct → Wrong (Category) n → g (case) g → n (case) p → s (number) s → p (number) p → g (case) r → c (part of speech) pg → sa (number & case) m → f (gender) f → m (gender) sa → pg (number & case) pn → sg (number & case) a → n (case) pn → sg (number & case) f → m (gender) m → f (gender) q → c (part of speech) sg → pn (number & case) a → n (part of speech)

Error Coverage 13.22 24.85 32.25 39.36 45.01 49.52 14.53 26.31 33.68 39.83 45.96 50.86 14.89 28.16 34.12 39.08 44.01 48.53

misclassification. This suggests that for specific ambiguity classes, either wrong feature functions have been prioritized or more complex feature functions would have to be generated, that address the issue of misclassification.

6 Conclusions The results of the application of maximum entropy modeling to Estonian, Latvian and Lithuanian confirms the suitability of this method for morphologically rich languages and corresponds well to the results for Czech [2]. The exponential tagger performs significantly better than the baseline and in two cases significantly better than HMM tagger. In the case of Latvian, we have observed an interesting deviation in favor of HMM tagger. Also the high tagset perplexity for Latvian indicates that careful investigation of training data quality is necessary. The feature selection algorithm used in our training and evaluation experiments does not consider interfeature relations, which lowers the final tagging precision because features, which in combination perform well, may not be selected and features, which in combination perform poorly, on the contrary, may be selected. Therefore, a better feature selection algorithm would be the use of iterative feature selection as explained by [2]. As we use the maximum entropy training method, the iterative feature selection would require large computing resources. An interesting experiment would be to run the iterative feature selection based on the simplified weight estimation algorithm and compare the results to the model acquired by maximum entropy training on the features selected by the iterative feature selection.

22

M. Pinnis and K. Goba

The tagger model could be extended to handle unknown words, allowing to avoid the shortcomings of the lexicon-based analyzer. In this case, the ambiguity class is unknown, and the model needs to be adjusted. One possibility would be to combine subtag classifiers trained on whole data (as opposed to conditioned by ambiguity class). In this case, some model of valid subtag combinations should be used to avoid predicting invalid tags. The combination of subtag models currently treats all subtags equally. This combination could be parameterized by weighing the individual subtag probabilities in a log-linear fashion, effectively treating subtag probabilities as feature values. This approach would allow the parameters to be tuned and allow minimum error rate training. Also, more features (like subtag classifiers over all training data) could be added. Acknowledgements. The research within the project Accurat leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013), grant agreement no 248347.

References 1. Berger, A., Della Pietra, S., Della Pietra, V.: A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22(1), 39–71 (1996) 2. Hajiˇc, J., Vidová-Hladká, B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: Proceedings of the COLING-ACL Conference, Montreal, Canada, pp. 483–490 (1998) 3. Halácsy, P., Kornai, A., Oravec, C.: HunPos — an Open Source Trigram Tagger. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 209–212. Association for Computational Linguistics, Prague (2007) 4. Malouf, R.: A Comparison of Algorithms for Maximum Entropy Parameter Estimation. In: Proceedings of CoNLL 2002, pp. 49–55 (2002) 5. Benson, S., More, J.: A Limited Memory Variable Metric Method in Subspaces and Bound Constrained Optimization Problems. In: Technical Report ANL/MCS-P909-0901, Argonne National Laboratory (2001) 6. Kaalep, H.-J.: An Estonian Morphological Analyser and the Impact of a Corpus on its Development. Computers and Humanities 31, 115–133 (1997) 7. MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages,

-. 8. A Simple C++ Library for Maximum Entropy Classification, /''/ 0 1'' 2 9. Morphologically Disambiguated Estonian Corpus, ' ' '

Non-canonical Inflection: Data, Formalisation and Complexity Measures Benoît Sagot1 and Géraldine Walther2 1

2

ALPAGE, INRIA Paris–Rocquencourt & Université Paris 7 Univ. Paris Diderot, Sorbonne Paris Cité, LLF, UMR 7110, 75013 Paris, France

Abstract. Non-canonical inflection (suppletion, deponency, heteroclisis, etc.) is extensively studied in theoretical approaches to morphology. However, these studies often lack practical implementations associated with large-scale lexica. Yet these are precisely the requirements for objective comparative studies on the complexity of morphological descriptions. We show how a model of inflectional morphology which can represent many non-canonical phenomena [67], as well as a formalisation and an implementation thereof can be used to evaluate the complexity of competing morphological descriptions. After illustrating the properties of the model with data about French, Latin, Italian, Persian and Sorani Kurdish verbs and about noun classes from Croatian and Slovak we expose experiments conducted on the complexity of four competing descriptions of French verbal inflection. The complexity is evaluated using the information-theoretic concept of description length. We show that the new concepts introduced in the model by [67] enable reducing the complexity of morphological descriptions w.r.t. both traditional or more recent models. Keywords: Inflectional Morphology, Description Complexity, MDL, Paradigm Shape, Canonicity, Inflection Zone, Stem Zone, Inflection Pattern, Stem Pattern.

1 Introduction Automatically generating all forms of a language’s inflectional paradigms is often considered a rather unchallenging task, since it has long been solved for most languages of interest to the area of natural language processing (NLP). On the other hand, there is much ongoing work in theoretical morphology within lexicalist approaches, and especially within Word and Paradigm related frameworks [46,72,1,60,26], on describing, modeling, and explaining inflection, and in particular non-canonical inflection phenomena. For example, the Surrey Morphology Group has been working on projects on Syncretism (1999–2002), Suppletion (2000–2003), Deponency (2004–2006) and Defectiveness (2006–2009). In 2003 G. G. Corbett published his first article on Canonical Typology [23], thereby laying the foundations for a theoretical approach aiming at capturing the discrepancy between regularity and irregularity in inflectional paradigms. C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 23–45, 2011. c Springer-Verlag Berlin Heidelberg 2011

24

B. Sagot and G. Walther

However, studies in theoretical morphology are sometimes limited by the lack of complete formalisations and large-scale implementations of the concepts they manipulate, both in terms of morphological and lexical coverage. Still, such resources are required for achieving qualitative assessments of the validity of a given approach or to compare the relevance of several morphological models describing a given language or a specific part of a language’s morphology, including the information encoded in the lexicon. This direction of research points towards recent work aiming at measuring linguistic and more specifically morphological complexity [7]. Indeed, this provides valuable insights into typological phenomena and properties of linguistic structures, and allows to compare various linguistic descriptions with objective metrics (see Section 5). In this paper, we follow a Word and Paradigm based view of morphology. We introduce metrics allowing for measuring the complexity of a morphological description. The underlying idea is that inflectional complexity lies in the amount and distribution of inflectional irregularities. Irregularities can be represented as specific rules within the morphological grammar or as additional information within the lexicon. Hence our complexity metrics apply to both the morphological grammar and the information stored in the lexicon, thus allowing for comparing different competing descriptions in terms of descriptive complexity. After a brief summary of the related work in both computational and theoretical inflectional morphology (Section 2), we first recall the definition of a large variety of non-canonical inflectional phenomena likely to cause increased descriptive complexity. These definitions are illustrated with data from French, Latin, Italian, Sorani Kurdish, Persian, Croatian, and Slovak (Section 3). In Section 4, we then present our formal model of inflectional morphology.1 We show how it covers all those non-canonical phenomena and allows for a formal representation of inflectional irregularity. Finally, in section 5 we implement our model, putting a particular emphasis on French verbal inflection, which exhibits several of these phenomena and received much renewed attention in the last few years. We show how the implementation of this formal model within the Alexina lexical framework [56] makes it possible to define an information-theoretic notion of complexity for morphological descriptions that includes both the model and the corresponding lexicon, based on description length. We assess the complexity of four different accounts of French verbal inflection. As a side-result of this experiment, we also show that our formal model is not only able to encode previously proposed morphological descriptions [15,56] but also provides a way to write a description of French verbal inflection that has a lower complexity.

2 Related Work 2.1 Related Work in Computational Morphology Within contemporary computational morphology, inflection is treated in rule-based, supervised and unsupervised methods, sometimes combined [62]. Among the first are: (i) stemming methods like the desuffixation algorithm by Porter [53]; (ii) bi-directional analysis and generation methods like Koskenniemi’s Two-Level 1

The reader will find a complete formal presentation in [67].

Non-canonical Inflection

25

Morphology [42] that uses transducers linking a deep (lexical) level to surface forms by applying systematic phonotactic transformations, as well as other finite state approaches [9]; (iii) morphosemantic approaches, mostly for specialised corpora [54,45,50,21]. The second type of approach, namely supervised learning methods, rely on annotated learning corpora that define the expected output [16,59]. Finally, there are the unsupervised approaches which can be used even for languages for which no preliminary description is available. These approaches rely on at least four different methods. – Acquisition of morphological information can be achieved through direct comparison of graphemes distributed over a given corpus. This has be done through edit distance measures [8], maximum affix recognition [37,31,71], word insertion tests [39,28,10], and analogies [43,35,49]; – Another method relies on entropy models based on Harris’ hypothesis [34]. They are mainly used for automatically detecting morph(eme)-boundaries through entropy measures [39,10,58]; – There are various types of probabilistic methods: Bayesian inference models [27], Bayesian hierarchy models [57] or probabilistic generative models [58]. – Segmentation methods relying on data compression [32,27] using the Minimum Description Length (MDL) principle [55]. The underlying idea is that morphology always tends to use the most compact encoding by relying on inflectional regularity (see Section 5 for more details). Such regularity is stated to show in the stem vs. exponent [46] distribution, which should allow for a segmentation of corpora into the smallest possible morph(eme)-units; All these unsupervised methods can also be combined with rule-based models, as in Tepper and Xia [62] who define contextual re-writing rules which they apply to the results of an unsupervised analysis in order to account for allomorphy in English and Turkish. Although there appear to be many different methods that can be used in Computational Morphology, none of these explicitly tackle the question of complexity, regularity or canonicity per se. 2.2 Related Work in Theoretical Morphology Morphological complexity however plays an important role in modern theoretical approaches to morphology. Within formal approaches to morphology, there are those who accept the existence of morphology and those who refuse it. The latter approaches are quite widespread and represented by Chomsky and Halle [22], Lieber [44] and Distributed Morphology, Halle and Marantz [33], while the former are illustrated by what is called the Word and Paradigm approach, e.g. Matthews, Aronoff and Stump [46,2,60]. Only in the Word and Paradigm approach, which are lexeme-based, does the question whether there is regularity within paradigms really matter. In this work, we consequently adopt a lexeme-based approach to morphology that vastly relies on Matthews’ view of stems and exponents [46]. The question of regularity in morphology is not always specifically addressed. Even in lexeme-based approaches, some works do not give the notion of regularity any

26

B. Sagot and G. Walther

theoretical status [60]. Yet, based on psycholinguistic evidence, such as presented by Pinker [51], there are modern approaches, such as illustrated in Deriving Inflectional irregularity by Bonami and Boyé [13], that treat irregularity as “a real grammatical phenomenon, that is manifest not only in psycholinguistic behaviour but also in language change and in synchronic grammar”. Work on (ir-)regularity in lexeme-based approaches has also been done in the subfield of Canonical Typology, such as presented by Corbett [23,25]. Non-canonical or irregular phenomena such as suppletion [19,13], deponency [6], heteroclisis [61], defectiveness [5] and more recently overabundance [64] have been studied within this approach, giving rise to quite a series of publications.2 However, these works have seldom explicitly targeted the development of descriptions that optimise their compacity. Indeed, in order to be able to evaluate a description’s compacity, a large scale implementation is required. This implementation has to rely on large-scale lexical resources covering (almost) all the described language’s relevant lexical items. It must also be able to implement the measured descriptions. A short state-of-the-art presentation of existing compacity measures is given in the introduction to Section 5.

3 Data on Non-canonical Inflection Couching our work in a Word and Paradigm approach to morphology, we define a morphological description as the combination of a set of inflectable lexical entries and corresponding realisation rules realising specific morphosyntactic features. The result of all applied realisation rules are the paradigms of a language’s lexemes. In order to assess the complexity of specific morphological descriptions, we start with identifying those phenomena that tend to be paradigm-complexity-increasing. These phenomena are the irregular, non-default cases. In terms of Canonical Typology, they are the noncanonical inflectional phenomena. 3.1 Canonical Inflection The concept of canonical typology can be traced back to Corbett [23] in an attempt to better understand what exactly differs from a hypothetic ideal canonical stage in the different occurrences of non-canonical phenomena. In this approach, canonical inflection must not to be mistaken for prototypical inflection. Canonical inflection is rare. It corresponds to an ideal stage, seldom met, but that constitutes a purely theoretical space from which deviant phenomena can be formally distinguished [24]. Canonical inflection is a notion that affects both the relation between the cells of a given lexeme’s paradigm and the corresponding cells belonging to two different lexemes’ paradigms. Canonical inflection is thus defined through the comparison of both the cells of one given lexeme and the lexemes themselves. 2

Existing work is mostly done on phonological data. Our work focuses on written data for now. We plan on doing some future work on comparing phonological and graphemic morphological complexity.

Non-canonical Inflection

27

Table 1. Criteria for Canonical Inflection according to Corbett [24]

1 2 3 4

Composition/structure Lexical material (≈ shape of stem) Inflectional material (≈ shape of inflection) Outcome (≈ shape of inflected word)

Comparison across cells of a lexeme

Comparison across lexemes

same same different different

same different same different

Table 2. Additional criteria for Canonical Inflection Canonical inflection 1 2

Feature expression Stem

3

Completeness

4

Inflection class

There is no “mismatch between form and function” [4]. Each lexeme has exactly one stem that combines with a series of exponents. There exists exactly one form corresponding to the expression of a specific morphosyntactic feature structure. All forms of a lexeme are built from one single inflection class.

We preliminarily consider an inflectional paradigm canonical if it satisfies the following criteria given in table 1 [24]. To these criteria we add the ones in table 2 that further define canonical paradigm shape.3 Deviation from these criteria leads to non-canonical paradigmatic properties. In this work, we present a representation of five types of non-canonical inflection phenomena, namely suppletion, deponency, heteroclisis, defectiveness and overabundance within the inferential realisational model for inflectional morphology developed by Walther in [67].4 In section 5 we show the impact these phenomena can have on the complexity of morphological descriptions. 3.2 Stem Alternations/Suppletion Suppletion comes in two types: stem suppletion and form suppletion [19]. Stem suppletion occurs whenever, inside a paradigm, the forms’ exponents remain regular, but their stems vary. This is for example the case for the French verb aller ‘to come’ which, according to most descriptions, shows as much as four different stems, all-, v-, i- and aill-. Form suppletion corresponds to cases where a whole form is inserted in a paradigm cell that should canonically be filled by a certain stem and the exponent corresponding to this specific cell. Form suppletion is described in [11] for the French verbe être ‘to be’ in the present indicative. For this verb, the 1st person plural form sommes, for example, does not show the regular 1st person plural exponent -ons that canonically appears with corresponding forms of other verbs (see table 3). 3 4

Among the additional criteria, criterion 2 derives directly from criterion 2 in [24] and criterion 4 can be seen as derived from criterion 3 in [24]. This model does not include a separate formalisation of syncretism. Syncretism is modeled as a combination of heteroclisis and deponency. For a complete discussion thereof, see [67].

28

B. Sagot and G. Walther

Table 3. Form suppletion in the present indicative paradigm of French être ‘to be’

Table 4. Persian present and past stems Lexeme Translation STEM 1 STEM 2

Singular Plural P1 P2 P3

suis es est

sommes êtes sont

ârâstan ‘to adorn’ âmuxtan ‘to learn raqsidan ‘to dance

ârâst âmuxt raqsid

ârâ âmuz raqs

Suppletion can be more or less transparent in the sense that it can be regularly associated with variation in the feature structure of a given word. Thus, Iranian languages such as Persian, show a stem alternation mainly related to tense-alternation: Persian uses a STEM 1 for the present tenses and a STEM 2 for the past tenses. The STEM 2 is also used for the infinitives and the participle, while STEM 1 serves as a stem for imperative forms. According to traditional descriptions [29], Latin verbs also display three distinct stems that are linked to specific morphosyntactic features and subparts of the inflectional paradigms, namely present, past, and supine: amo ‘I love’, am¯av¯ı ‘I loved’, and am¯atum ‘loved’. Yet the distribution of these stems does not follow strictly transparent feature-form associations. The third stem, for example, is associated with the passive past participle, but also with the active future participle and the finite passive perfective forms. There is no explicit morphosyntactic feature that appears to trigger the use of the third stem. Yet the distribution of the third stem is regular over all regular Latin verbs. Thus they are morphomic in the sense of Aronoff [2]. Moreover, suppletion can be more or less massive a phenomenon. While the Latin data only concerns three different stems, French verbs show stem suppletion that extends to twelve different stems [12]. Bonami and Boyé [12] show that there are up to twelve different feature combinations that can trigger stem suppletion. They call these twelve combinations stem spaces. The stems belonging to the stem spaces are linked through stem dependency. Yet, among those languages for which stem selection seems to be an expression of morphosyntactic features, such as the Iranian languages, further irregularity can still occur. Thus, Sorani Kurdish displays specific stem selection irregularities: As Persian, Sorani Kurdish has distinct stems for present and past tense forms (respectively STEM 1 and STEM 2). Usually passive stems are built from STEM 1; yet for some verbs, the passive uses STEM 2, while for a third type of verbs a specific passive stem is required [47,63,66] (see table 5). Such additional irregularities need to be captured before a corresponding morphological description’s complexity can be measured. 3.3 Deponency Croatian nouns sometimes use singular forms to express plural [3]. This “mismatch between form and function” is what, following Baerman [4], we name deponency. Nouns are inflected according to a number of different declension classes. Some classes that are relevant for our discussion are shown in table 6: the nouns dete ‘child’ and tele

Non-canonical Inflection

29

Table 5. Sorani Kurdish Irregular Passive Stems Passive Stem Formation

Lexeme

STEM 1

Present Passive Stem

Translation

STEM 1

KUŠTIN

kuš

kuš–rê

‘to kill’

STEM 2

ÛTIN

STEM 2

BISTIN

l’ˇe bîe

ût–rê bist–rê

‘to say’ ‘to hear’

ke de

k–rê d–rê

‘to do, ‘to make’ ‘to give’

xo gir

xû–rê gîr–rê

‘to eat’ ‘to take’

STEM 1 STEM 1

minus endvowel minus endvowel

other other

KIRDIN DAN XWARDIN GIRTIN

Table 6. Croatian noun declension (Fem.) a-stem žena ‘woman’ NOM ACC GEN DAT INS

Sg. žen-a žen-u žen-e žen-i žen-om

Pl. žen-e žen-e žen-a žen-ama žen-ama

Table 7. Croatian deponent noun declension

(Fem.) i-stem stvar ‘thing’ Sg. stvar stvar stvar-i stvar-i stvar-i

Pl. stvar-i stvar-i stvar-i stvar-ima stvar-im

neut. -et~a-stem neut. -et~i-stem dete ‘child’ tele ‘calf’ NOM ACC GEN DAT INS

Sg. dete dete deteta detetu detetom

Pl. deca decu dece deci decom

Sg. tele tele teleta teletu teletom

Pl. telad telad telad teladi(ma) teladi(ma)

‘calf’ inflect according to the singular pattern of respectively the A - STEM and I - STEM inflection classes. Using a singular inflection to express plural results in this mismatch between form and function. The Surrey Morphology Group has collected a whole range of data on deponency phenomena in a large database.5 Even though we will see in Section 4 that our model would not retain all these examples as instances of deponency, this database constitutes an excellent general overview of deponency phenomena. The most often discussed example of deponency probably are the Latin deponent verbs, where active meaning is considered to be conveyed through passive morphology [41,36,4,25]. However, we shall give an alternate analysis of this particular data in Section 4, showing that these verbs actually are not instances of deponency but rather constitute a textbook example of heteroclisis [68]. 3.4 Heteroclisis Heteroclisis refers to the phenomenon where a lexeme’s paradigm is built out of (at least) two, otherwise separate, inflection classes. Examples of heteroclisis are (some) Slovak animal nouns. In Slovak, most masculine animal nouns are inflected as masculine animate nouns in the singular, whereas they 5

30

B. Sagot and G. Walther Table 8. Heteroclisis in Slovak masculine animal names inflection Masculine animate chlap ‘boy’

NOM GEN DAT ACC LOC INS

Singular chlap chlap-a chlap-ovi chlap-a chlap-ovi chlap-om

Plural chlap-i chlap-ov chlap-om chlap-ov chlap-och chlap-mi

Masculine inanimate dub ‘oak’ Singular dub dub-a dub-u dub dub-e dub-om

Plural dub-y dub-ov dub-om dub-y dub-och dub-mi

Masculine heteroclite orol ‘eagle’ Singular orol orl-a orl-ovi orl-a orl-ovi orl-om

Plural orl-y orl-ov orl-om orl-y orl-och orl-ami

may (and for some lexemes, must) inflect as masculine inanimate nouns in the plural (except in specific cases, such as personification, which triggers the animate inflection even for plural forms) [70]. Compare for example the inflection of chlap ‘boy’, dub ‘oak’ and orol ’eagle’ in table 8.6 3.5 Defectiveness Defectiveness [5] refers to lexemes which display empty (missing) cells in their paradigm. Sometimes languages contain lexemes for which expected forms are simply nonexisting; native speakers are not capable of building the corresponding forms. Whenever such forms are needed, they must be conveyed through forms belonging to a synonymous lexeme. This is for example what we can observe with activa tantum: transitive verbs that do not possess passive forms and must therefore borrow the forms from synonyms. Examples thereof are the Latin verbs facere ‘make’ and perdire ’destroy’ with no passive morphology in the present tense. The missing passive forms are supplied by (different) active verbs, namely fieri ‘become’ and perire ‘perish’ [41]. These supply verbs are not just passives for the former ones, but also normal intransitives. Hence, they canot be counted as part of the defective verbs’ paradigms. They possess their own independent paradigm and constitute independent lexical entries. Another example are the nouns called pluralia tantum which only exist in the plural, cf. English trousers, French vivres ‘food supplies’ or Slovak Vianoce ‘Christmas’. 3.6 Overabundance The obvious counterpart to defectiveness is the concept of overabundance. Overabundance occurs when cells of a paradigm contain more than one form. The notion has been introduced by Thornton and is discussed in [64] for Italian. Canonical overabundance characterises the case where cell mates of one given cell compete, without any morphological feature permitting to choose one over the other. Table 9 shows examples thereof for Italian verbs. 6

Both chlap and dub have a regular inflection: chlap belongs to the standard inflection class for masculine animate stems ending with a consonant, whereas dub belongs to the standard inflection class for masculine inanimate stems ending with what is called a hard or neutral consonant in the Slavic linguistic tradition.

Non-canonical Inflection

31

Table 9. Overabundance in Italian [64]

‘languish’ 3 PL . PRS . SUBJ ‘possess’ 3 PL . PRS . SUBJ ‘possess’ 3 SG . PRS . SUBJ ‘possess’ 1 SG . PRS . SUBJ

Table 10. Overabundance in French asseoir ‘to sit’ IND . PRES P1

Singular Plural assois assoyons assieds asseyons

cell mate 1

cell mate 2

languano possiedano possieda possiedo

languiscano posseggano possegga posseggo

Table 11. Overabundance in French balayer ‘to sweep’ IND . PRES P1

Singular Plural balaye balayons balaie

P2

assois assieds

assoyez asseyez

P2

balayes balayez balaies

P3

assoit assied

assoient asseyent

P3

balaye balaie

balayent balaient

In French, an example is given by the verb asseoir ‘to sit’ that has two different forms in most cells as shown in table 10.7 All French verbs in -ayer also exhibit systematic overabundance (see table 11). Indeed, for some cells, these verbs may use two competing stems (in -ay- and in -ai-) and therefore have two different inflected forms, morphologically equivalent (although semantic, pragmatic, sociolinguistic and other constraints may interfere).

4 A Formal Model for Inflectional Morphology 4.1 Defining the Relevant Notions Since the non-canonical phenomena described in section 3 are precisely the irregularities that add complexity to the description of a lexeme’s paradigm, we need a model capable of completely formalising the relevant irregularities. Only then can we use the formalised descriptions to measure their complexity with appropriate complexity metrics. We use the formal inferential realisational model for inflectional morphology described in [67]. In this model a lexeme is considered w.r.t. its formal participation in the inflectional process. Thus, we do not consider any specific semantics or possible derivational properties. In other words, we are here interested in the behaviour of what Fradin and Kerleroux refer to as inflectemes [30], as opposed to lexemes, and for which a (very) simplified definition could be “a lexeme minus its semantic and argument-structural information”. This model represents an inflecteme I as the association of five defining elements: (1) the set of morphosyntactic feature structures I can express, (2) the lexeme’s morphosyntactic category, (3) a stem formation rule, (4) an inflection rule, (5) a transfer rule. 7

See for example [14] for a longer discussion thereof.

32

B. Sagot and G. Walther

Defectiveness and Overabundance. In our model, categories are assigned to sets of inflectemes that canonically share sets of morpho-syntactic features. Belonging to a specific category creates morphological expectation in the sense of Brown et al. [20] as to which features should be realised by independent forms. If these expectations are not met by an inflecteme’s forms this inflecteme is considered defective. Thus, defectiveness is defined for an inflecteme I as the property of not fulfilling its category driven expectations: there is at least one morphosyntactic feature structure that should be expressed by the I’s categorie’s members for which no form is produced for I. Whenever more forms are generated than what is expected of a given inflecteme (given its membership of a certain category), this inflecteme is considered overabundant. Thus, defectiveness and overabundance occur whenever the inputs and outputs of an inflection rule f are not in a 1 to 1 correspondance. Let us consider the French nominal inflecteme I of vivres ‘food supplies’ as an example. Concerning the feature NUMBER, French nouns are expected to express the set of feature-value pairs {NUMBER singular, NUMBER plural}. However, vivres produces a realisation for the feature structure { NUMBER plural} only. It is hence defective. Conversely, the Italian data in 9 shows instances of overabundance. For example, the inflecteme of languire is such that the realisation associated with the feature structure {NUMBER plural, PERSON 3, TENSE present, MODE subj} produces two forms, languano and languiscano. Stem Selection and Suppletion: stem zones. The stem formation rule and the inflection rule are used for expressing the morphomic dimension of inflectional paradigms belonging to a given lexeme. Hence, stem alternation in Latin can be represented through the existence of three different stem zones that are sets of cells in which the stem realisation rules associated with expressible morphosyntactic features always produce one type of stem, as shown in tables 12 and 13 for the features listed in table 14. Suppletion can hence be associated with specific stem zones. Moreover, the model allows for expressing that a given inflecteme I is associated with specific stem zones through the notion of the inflecteme’s stem pattern. In table 14, the active subparadigm’s stem pattern comprises STEM 1, STEM 2 and STEM 3, while the passive subparadigm’s stem pattern comprises only STEM 1 and STEM 3. Form Realisation: inflection zones. In [67], an inflection class is defined as the default association of morpho-syntactic features with form realisation rules that apply to stem zones of a given inflecteme. Just as we have defined stem zones, we then define an inflection zone as denoting the behaviour of a particular inflection class for a given set of cells. More precisely, each inflection class can be partitioned into inflection zones. In combination with stem zones, inflection zones allow for modeling situtations in which, for example, a given set of exponents is applied to two different stems of the same inflecteme for expressing different morphosyntactic features: the same inflection zone will thus be involved twice in the same paradigm. As sketched above and shown in more details in section 4.1, inflection zones and stem zones allow for a novel analysis [68] of so called Latin deponent verbs [41,61,36].

Non-canonical Inflection Table 12. A: Stem zones in the Latin active (sub-)paradigm

33

Table 13. B: Stem zones in the Latin passive (sub-)paradigm

STEM 1

STEM 1

...

STEM 3 STEM 2

S

s s

STEM 3

Table 14. Morphomic feature association with Latin verb stems Stem

Active subparadigm

Passive subparadigm

STEM 1

imperf. finite

imperf. finite

STEM 2

perf. finite

STEM 3

active future part.

passive past part. perf. finite (periphr.)

Deponency. Another non-canonical phenomenon that may occur is deponency. As said above, we follow Baerman [4] in defining deponency as a “mismatch between form and function”. This mismatch occurs whenever the features to be expressed by an inflecteme do not match the features usually expressed by a specifc realisation rule. This fact is captured by the notion of transfer rule, which takes as input a set of features to be expressed and outputs the set of features corresponding to the appropriate realisation rule. Canonically, the transfer rule is the identity function. An inflecteme is considered deponent whenever an inflecteme’s transfer rule differs from the identity function. In order to model the Croatian data from table 6, we can thus use the transfer rule. Recall that Croatian sometimes uses singular forms to express plural [3]. In our model, an inflecteme I functioning this way has a transfer rule TI such that TI ({NUMBER plural})={NUMBER singular}. Heteroclisis. Moreover, for Croatian deponent nouns, the inflection rule f outputs the zones in table 15 for the irregular nouns in table 7.8 A, B and C correspond to the three different inflection classes illustrated in table 7. The nouns dete ‘child’ and tele ‘calf’ use exponents from two different inflection classes each to buid their paradigm. It is thus heteroclite. A similar analysis can be made of Latin “deponent verbs”. Latin “deponent verbs” show morphological passive (“m-passive”) forms, but express active syntax (“sactive”). Therefore, they are usually considered instances of deponency in the sense of [4]. On the basis that applying passive morphology to Latin verbs does not necessarily 8

The representation shows that, in addition to being deponent, Croatian nouns are also heteroclite. Non-canonical behaviours can sometimes combine.

34

B. Sagot and G. Walther Table 15. Croatian noun inflection zones for deponent lexemes Inflection class A: neuter B: (feminine) C: (feminine) -et-stem a-stem i-stem dete ‘child’ tele ‘calf’

SG : SG :

zoneA,sg PL : zoneB,sg zoneA,sg

Table 16. Zones in Class A A1

PL : zoneC,sg

Table 17. Zones in Class B

A4

B1

A3

B3

A2

B2

Table 18. Zone distribution in Latin verb inflection Lexeme Type

m-active

Actives Passives Deponents Semi-dep. T1 Semi-dep. T2

A1, A2, A3, A4 A3, A4 A1, A3, A4 A2, A3, A4

m-passive B1, B2, B3 B1, B2, B3 B2, B3 B1, B3

entail applying passive value, as shown in [41],9 we consider that there are distinct inflection classes applying mainly to active vs. passive morphological forms (“mpassive”): changing a verb’s inflection class is seen as a derivational process. Since there are distinct endings for m-active and m-passive, we claim that there must be distinct inflection rules, i.e., for every inflecteme, distinct pairings between specific morphosyntactic feature structures and inflection zones belonging to specific inflection classes, see figures 16 and 17. Based on our definition of inflection zones, deponent verbs can be analysed as heteroclite [68], most of their endings being retrieved through inflection zones belonging to a Class B while the additional forms are retrieved from zones in a Class A (namely A3 for the active participles and A4 for the gerunds). Given an inflecteme I, a pair formed by an inflection zone and a corresponding stem zone is called a subpattern. The complete inflection pattern of I, which consists of a set of subpatterns, allows for building all of I’s inflected forms. For example, the inflection of a passive Latin verb is fully defined by the following set of subpatterns: B1+STEM 1, B2+STEM 3, B3+STEM 3. They constitute the inflection pattern of all such verbs. 9

Indeed, Kiparsky [41] shows that passive morphology can trigger many kinds of, partly unpredictable, semantic changes. This property is one of derivational morphology — and not inflectional morphology which is usually considered as being semantically predictable [17].

Non-canonical Inflection

35

Slovak animal nouns also show an instance of heteroclisis. As shown in table 8, the zones for building the singular forms of the noun orol ‘eagle’ are partitions of the animate inflection class like those used for inflecting chlap ‘boy’, whereas the zones for the plural are retrieved in the inanimate inflection class, like for dub ‘oak’. Canonical Inflection. It follows from the above-described irregularities that Canonical inflection corresponds to the case where – an inflecteme’s I inflection pattern and stem pattern consist of only one (inflection resp. stem) zone each, – the inflection rules produce exactly one possible realisation for each morphosyntactic feature structure expressable by I’s category, – there is no mismatch between form and function, i.e. each exponent realized by a given realisation rule for a given morphosyntactic feature structure exactly corresponds to the morphosyntactic feature structure usually expressed in combination with this exponent.

5 Measuring the Complexity of Various Descriptions of French Verbal Inflection We have shown how non-canonical inflectional phenomena can be encoded in the model of inflectional morphology described in [67], using new notions such as inflection and stem zones. They can be viewed as generalisations of Bonami and Boyé’s [12] stem spaces (and before them [52]), which, in turn, are correspond to stem pattern in this model. With such a formalism, various competing analyses for the same data can be designed, implemented, and therefore quantitatively evaluated with a suitable complexity measure. Not only does this provide a way to compare such analyses w.r.t. their complexity, but it is also a way to get insights into the relevance of these new notions, by examining whether they are used in analyses that have a lower complexity. For answering these questions, we have developed and implemented a formalism capable of representing the model described in section 4. The basis for our formalism is the morphological layer of the Alexina lexical formalism [56] used by several morphological (and, for some, syntactic) lexica. We have extended this formalism in order to allow it to deal with inflection zones, transfer rules, patterns and stem patterns. Next, we have encoded various competing morphological descriptions of French verbal inflection in this formalism, in order to assess the relevance of these newly introduced notions and to quantatively compare these descriptions by means of the notion of complexity. 5.1 Descriptions of French Verbal Inflection French verbal inflection is interesting in many well-known aspects, some of which have been described above. First, it is a rich system that generates forms corresponding to up to 40 different morphosyntactic features. Second, and this is of particular relevance when trying to assess the complexity of morphological descriptions, it is traditionally described as having one regular and productive inflection class, the class of so-called

36

B. Sagot and G. Walther

first-group verbs (verbs in -er), one irregular inflection class, that of third-group verbs, and the inflection class of second-group verbs (verbs like finir), which is sometimes considered as regular, as in traditional grammars, and sometimes as irregular. Analyses differ about the real productivity and regularity of this class [40,18,15], which is one first possible source of discrepancy between different accounts of French verbal inflection. Among first-group verbs, as described in Section 3.6, verbs in -ayer exhibit (regular) overabundance. In [14], the authors consider them as polyparadigmatic. This is not fully satisfactory given the fact that both supposed paradigms would share the same forms for half of the cells. Another way to represent this situation is to define two stems, one in -ay- and one in -ai-, two inflection zones: one, ζ1 , that will be used only by the -aystem, and one, ζ2 , that will be used by both stems. Therefore, there would be three subpatterns within the (specific) inflection pattern for -ayer verbs: ζ1 +-ay-, ζ2 +-ai- and ζ2 +-ay-. Modeling second-group verbs can also be achieved in different ways. Using Bonami and Boyé’s [12] twelve-stem approach, these verbs can explicitely specify a secondary stem in -iss in the lexicon, along with the base -i stem (fini- vs. finiss- for finir ‘finish’). The traditional (and widespread) way to represent this inflection class is to consider that it uses suffixes that begin in -ss- in certain cells. Obviously, this is not very satisfying. But as it happens, the cells for which second-group verbs use their secondary stem are exactly those which are covered by the zone ζ1 defined in the previous paragraph. Therefore, if one defines a unique inflection class for first- and second-group verbs, the same ζ1 and ζ2 can be used here as well, together with the following stem pattern: starting from the base stem in -i, the secondary stem can be obtained through the addition of ss, while the inflection pattern is defined by two subpatterns, namely ζ1 +-iss- and ζ2 +-i-. Note that this corroborates the empirically grounded analysis in [65]. As for third-group verbs, the only two approaches that we have considered are the traditional one, using many inflection classes, and the twelve-stem approach by Bonami and Boyé [12]. Representing the latter approach in our model can be easily achieved, by modeling (default) stem dependencies within a stem pattern, and specifying for each verb (only) those stems that differ from what can be regularily obtained using the defined stem pattern. Starting from these considerations, we have developed four different descriptions of French inflection in the new version of Alexina that implements our morphological model, in order to try and measure their respective complexity. 5.2 Quantifying and Measuring Morphological Complexity10 In recent years, finding appropriate means to measure language complexity has become an active area of research. In increasing order of specificity, work has been done in that direction by considering languages globally [48,38], by restricting the study to one particular level such as morphology [7], and by measuring the complexity of particular morphological descriptions, most notably in the context of unsupervised or weakly supervised learning of morphology [32,69]. 10

The title of this section is borrowed from [7], whose first sections provide a brief but complete and detailed account of recent work on this topic.

Non-canonical Inflection

37

Various metrics for measuring linguistic complexity can be found in the litterature. The simplest ones simply count the occurrences of a handcrafted set of linguistic properties: size of various inventories (e.g., phonemes, categories, morph(eme) types. . . ) [48]. However, such approaches are intrinsically arbitrary: both the set of properties which are chosen and the criteria underlying the way these properties are described are very hard to define in a principled way (e.g., what would be a suitable objective and language-independant way to build an inventory of categories for any language?). Alternative ways of measuring complexity rely on definitions of complexity that come from information-theoretic considerations. Two distinct definitions have been used in recent work, which apply on any kind of message and not only on linguistic descriptions or models: information entropy (or Shannon complexity), whose main drawback is that it requires encoding the message as a sequence of independent and identically-distributed random variables according to a certain probabilistic model, which is difficult in practice; and algorithmic entropy (or Kolmogorov complexity) which is a more general and objective measure of the amount of information in a message, but which is not directly computable and has to be approximated. The Kolmogorov complexity, because it is more general, is more appealing. It relies on the following intuitive idea: a model is more complex than another if it requires a longer message to be described. However, since its computation is not directly possible, one often reduces the problem to computing some kind of entropy within a particular space of possible models, by using an approximation of the Kolmogorov complexity that is defined over this model space: the result is called the description length w.r.t. the model. This is the basis of the paradigm called Minimum Description Length [55]. Therefore, computing an approximation of the Kolmogorov complexity of a linguistic description requires to define as optimal as possible a way to encode this description as a string (the “code”), and then a means of computing an approximation of the Kolmogorov complexity of that (coded) string [7]. Moreover, a linguistic model is often structured, contrarily to what studies involving morphological complexity sometimes assume. In particular, assessing the complexity of a representation of a morphological lexicon cannot be reduced to measuring the complexity of a corpus whose forms have been segmented into morphs — which is however the basis of pioneering work in automatic acquisition of morphological information [32]. In our case, we want to measure the complexity of a given description of (a given part of) a morphological description of a particular language. This is to be contrasted with cross-linguistic comparative studies on morphological (or linguistic) complexity in general [48,38,7]: we do not want to estimate the complexity of a language, but that of particular descriptions of its morphological component, and, more specifically, of its lexical inflectional system. The description length DL(m) of an unstructured message m within a model that decomposes it as a sequence of N symbols taken from an alphabet W = {w1 , . . . , wn } can be computed as: DL(m) = − ∑i o(wi ) log2 o(wi )/N, where o(wi ) is the number of occurrences of wi in m. This description length is equal to N times the entropy of the message. In our case, the code can not be that simple, as a morphological description is structured. First, as explained earlier, it is decomposed into the morphological lexicon

38

B. Sagot and G. Walther

and the morphological model. In our formalism, we define a lexical entry as a citation form, an inflection pattern, an optional non-default stem pattern and an optional list of non-predictable stems (predictable stems need not be specified). As for the morphological model, it involves patterns and subpatterns, tables, zones and form formation rules, sandhi rules11 and other factorisation devices (see below for examples). We have designed a code that encodes all this structure in a bijective way (it can be non-ambiguously decoded) using symbols from 16 different alphabets (one for letters in citation forms, one for morphosyntactic tags, one for pattern ids, one for structural information within tables, and so on). As shown by preliminary experiments, the use of various alphabets leads to shorter descriptions as measured by the following generalisation of the above-mentioned formula: if a message m is decomposed loselessly in a sequence of symbols taken from the union of p alphabets W 1 , . . . ,W p , W 1 = {w11 , . . . , w1n1 }, . . . ,W p = {w1p , . . . , wnpp } (i.e., the alphabet from which a given symbol is taken can be inferred deterministically from its left context), then we define its description length as: p np

DL(m) = − ∑ ∑ o(wip ) log2 j=1 i=1

o(wip ) , Np

where Np is the number of symbols from alphabet W p in m. Such a metric allows for approximating the complexity of a structured model, and to measure the contribution of each alphabet to that complexity. This is the way we computed the complexity of various morphological descriptions of French verbal inflection in our model, both for evaluating the relevance of the newly introduced concepts (e.g., inflection zone, inflection pattern) and for comparing these competing morphological descriptions. 5.3 Measuring the Complexity of Various Morphological Descriptions of French Verbal Inflection We have described above a spectrum of possible descriptions that correspond to various ways to balance richer morphological grammars and richer lexical specifications. We used the lexical information in the lexicon Lefff [56] for our experiments, limiting ourselves to verbs and ignoring multiple entries for the same lexeme (a given lexeme may have several sub-categorisation patterns and several meanings, and therefore have several entries in the Lefff ). The current version of the Lefff contains 7,820 verbs, among which 6,966 first-group verbs, 315 second-group verbs and 539 third-group verbs. In our new version of Alexina’s morphological layer, the morphological information associated with a lexical entry contains the following elements, illustrated by the example below: – a citation form, typically the infinitive for French verbs; 11

We define sandhi rules as morphographemic and/or morphophonemic rules, already implemented in the Alexina formalism. They are local transformations that apply at the boundary between two morph(eme)s. Hence, in French verbal inflection, a stem ending in -g followed by a suffix in [aou]- is associated with a surface form in which an e is appended to the stem: mang_ons ↔ mange_ons.

Non-canonical Inflection

39

– an inflection pattern followed by an optional pattern variant: if two patterns only differ on a few slots, they can be merged, and alternate realisation rules are specified for these slots and are lexically triggered by these infection pattern variants; – optionally, a stem pattern (a non-specified stem pattern means that the default stem pattern associated with the pattern should be used); – optionally, a list of stems (a non-specified stem means that the default stem should be used, as defined by stem formation rules associated with the stem pattern) For example, an entry such as “bouillir v23r /bouill,,bou” corresponds to an inflecteme with the citation form bouillir, the pattern v (with the pattern variant 23r), the default stem pattern associated with pattern v in the morphological grammar, as well as bouill as stem 1 and bou as stem 3 (all other stems following the stem pattern). Let us now briefly describe our four competing descriptions of French verbal inflection. The lexical entries for a small set of inflectemes in each of these descriptions is shown in table 19 for illustration purposes. At one end of the spectrum, we automatically generated a “flat” morphological description, called F LAT , that uses no stems, no sandhi and no zones, in the following way. The longest common substring shared by all inflected forms of each lexeme has been identified, and the remainder of each form has been considered a “suffix”; then the list of all suffixes has been ordered w.r.t. the corresponding morphosyntactic tags, thus creating a signature. Finally, all lexemes that share the same signature have been considered as belonging to the same inflection class which is trivially built from the signature and the ordered list of tags. The resulting description has 139 inflection classes. Its description length, measured as explained above, is around 131,400 bits (9,200 bits in the lexicon,12 122,200 bits in the morphological grammar). Table 19. Lexical entries for a small set of inflectemes in each of our four competing descriptions of French verbal inflection Citation form Flat Orig

New

BoBo

aimer acheter jeter balayer finir requérir cueillir prendre mettre

v-er v-er v-eter v-ayer v-ir2 v-ir3 v23r /cueill„„„„,cueill v-prendre v-mettre

v1 v1 v1 /,jett„„„„jett v-ayer1 v23r v23r /requér,requier„„„„requer,requi,requis v23r /cueill„„„„,cueill v3re v3re /„met„„„„mi,mis

v1 v18 v8 v12 v2 v42 v51 v24 v17

v-erstd v-erstd v-erdbl v-ayer v-ir2 v-ir3 v-assaillir v-prendre v-mettre

At the other end of the spectrum lies Bonami and Boyé’s [12,15] analysis, which uses only one inflection class and twelve stems. We started from a preliminary DATR implementation of this model (Bonami, p.c.). Because this analysis was designed on 12

Here and in all subsequent figures, the description length of citation forms is not taken into account, as it is the same for all descriptions.

40

B. Sagot and G. Walther

phonemes, we had to apply certain transformations to enable the encoding of graphemic inflection, including by introducing sandhi operations. In order to correctly generate all overabundant forms, we extended it in several ways. The result is a description, called B O B O, that contains only one inflection class, several patterns (4 for non-defective verbs including “v” and “v-ayer” found in table 19, and a few more for defective ones) and 61 sandhi rules.13 This description strongly relies on an important feature of our Alexina implementation of the model described in this paper and already mentioned above: any underspecified piece of information is filled by defaults (not specifying a given stem in a lexical entry leads to using the stem formation table for generating it; not specifying a stem generation table for a given lexical entry leads to using the default stem generation table associated with its inflection pattern, and, if not specified, to consider that there is only one stem that applies to all forms, and so on). B O B O’s description length is around 52,000 bits (46,600 bits in the lexicon, which is particularily high and is caused by all explicitely specified stems, and 5,400 bits in the morphological grammar, which is very low as expected). Between these two extremes, the original description O RIG used by the Lefff , which heavily relies on sandhis but uses a lot of inflection classes for third-group forms, has a description length of 83,000 bits (8,100 bits in the lexicon, 74,800 bits in the grammar). More interestingly, as mentioned above, using the notion of inflection zone and relying on a reasonable amount of sandhi rules, we were able to develop a more satisfactory morphological description for French verbs, named N EW, which uses 20 inflection classes (including one for first-group verbs without overabundance, one for first-group verbs in -ayer and one for second-group verbs). The corresponding description length, 35,800 bits, is lower than that of B O B O. It corresponds to 20,100 bits in the lexicon (twice more than in F LAT, but twice less than in B O B O) and 15,700 bits in the morphological grammar (three times more than in B O B O, but eight times less than in F LAT ). Table 20. Description length of various accounts of French verbal morphology Description name

Flat

Orig

New

BoBo

Length of the morphological grammar (bits) 122,200 74,800 15,700 5,400 Length of the lexicon (bits) 9,200 8,100 20,100 46,600 Total length of the morphological description (bits) 131,400 83,000 35,800 52,000

All these figures for our four descriptions, ordered according to the above-mentioned continuum, are summarised in table 20 and displayed graphically in figure 1. They make it visible that using the notion of inflection zone, thus generalising the notion of stem space (which in our model corresponds to the notion of stem pattern), leads to accounts of French verbal morphology that constitute a shorter coding of the same information than the three other descriptions, both traditional ones and more original and recent ones [15]. Note that this conclusion would have been different if the description length of the lexicon had not been taken into account. However, as the balance between including 13

For example, one of these rules handles the at the end of some of the stems, depending on its environment.

Non-canonical Inflection

41

Fig. 1. Visualisation of the description length of various accounts of French verbal morphology

more information in the lexicon and more in the morphological model depends on the morphological description, it would make little sense to evaluate the description length of the morphological model only.

6 Conclusion In this paper, we have addressed the question of measuring the complexity of morphological descriptions using the information-theoretic concept of description length. We have applied our method on four competing descriptions of French verbal inflection. Since descriptive complexity arises with inflectional irregularities, we have couched our descriptions in the formal inferential realisational model developed in [67]. This model, which relies on new notions such as the one of inflection zone and stem zone, allows for modeling a wide range of non-canonical inflectional phenomena, such as suppletion, deponency, heteroclisis, defectiveness and overabundance. We have developed four descriptions of French verbal inflection in this model and implemented them in the Alexina [56] morphological framework. We have also designed an informationtheoretic way to assess the complexity of a morphological description in this model. Our work shows that using information-theoretic concepts to assess description complexity is indeed feasable and relevant as a comparison between competing descriptions. Moreover, quantitative results on our four different descriptions have shown that the traditional way of describing French verbal inflection using many inflection classes, as well as a more recent and radically different proposal [15], can both be outperformed in

42

B. Sagot and G. Walther

terms of low complexity by using notions such as inflection zones, stem patterns and inflection patterns, in order to find a better balance between the amount of morphological information that is encoded in the lexicon and in the morphological rules.

References 1. Anderson, S.R.: A-morphous Morphology. Cambridge University Press, Cambridge (1992) 2. Aronoff, M.: Morphology by Itself. MIT Press, Cambridge (1994) 3. Baerman, M.: Deponency in Serbo-Croatian. Online Database (2006), 4. Baerman, M.: Morphological Typology of Deponency. In: Baerman, M., Corbett, G.G., Brown, D., Hippisley, A. (eds.) Deponency and Morphological Mismatches, vol. 145, pp. 1–19. The British Academy, Oxford University Press (2007) 5. Baerman, M., Corbett, G.G., Brown, D.: Defective Paradigms: Missing forms and what they tell us. Oxford University Press, Oxford (2010); Proceedings of the British Academy 145 6. Baerman, M., Corbett, G.G., Brown, D., Hippisley, A. (eds.): Deponency and Morphological Mismatches. Oxford University Press, Oxford (2007) 7. Bane, M.: Quantifying and measuring morphological complexity. In: Chang, C.B., Haynie, H.J. (eds.) Proceedings of the 26th West Coast Conference on Formal Linguistics, Sommerville, USA (2008) 8. Baroni, M., Matiasek, J., Trost, H.: Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: Proceedings of the ACL Workshop on Morphological and Phonological Learning, pp. 48–57 (2002) 9. Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI, Stanford (2003) 10. Bernhard, D.: Apprentissage non supervisée de familles morphologiques par classification ascendante hiérarchique. In: Proceedings of TALN 2007, Toulouse, France, pp. 367–376 (2007) 11. Bonami, O., Boyé, G.: Suppletion and dependency in inflectional morphology. In: Eynde, F.V., Hellan, L., Beerman, D. (eds.) Proceedings of the HPSG 2001 Conference. CSLI Publications, Stanford (2002) 12. Bonami, O., Boyé, G.: Supplétion et classes flexionnelles dans la conjugaison du français. Langages 152, 102–126 (2003) 13. Bonami, O., Boyé, G.: Deriving inflectional irregularity. In: Proceedings of the 13th International Connference on HPSG, pp. 39–59. CSLI Publications, Stanford (2006) 14. Bonami, O., Boyé, G.: La morphologie flexionnelle est-elle une fonction? In: Choi-Jonin, I., Duval, M., Soutet, O. (eds.) Typologie et comparatisme. Hommages offerts à Alain Lemaréchal, Peeters, Leuven, Belgium, pp. 21–35 (2010) 15. Bonami, O., Boyé, G., Giraudo, H., Voga, M.: Quels verbes sont réguliers en français? In: Actes du premier Congrès Mondial de Linguistique Française, pp. 1511–1523 (2008) 16. van den Bosch, A., Daelemans, W.: Memory-based morphological analysis. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 285–292 (1999) 17. Boyé, G.: Régularité et classes flexionnelles dans la conjugaison du français. In: Roché, M., Boyé, G., Hathout, N., Lignon, S., Plénat, M. (eds.) Des unités morphologiques au lexique. Hermes Science (2011) 18. Boyé, G.: Problémes de morpho-phonologie verbale en français, espagnol et italien. Ph.D. thesis, Universitè Paris 7 (2000) 19. Boyé, G.: Suppletion. In: Brown, K. (ed.) Encyclopedia of Language and Linguistics, 2nd edn., vol. 12, pp. 297–299. Elsevier, Oxford (2006)

Non-canonical Inflection

43

20. Brown, D., Chumakina, M., Corbett, G.G., Popova, G., Spencer, A.: Defining ‘periphrasis’ : key notions (2011); under editorial review 21. Cartoni, B.: Lexical morphology in machine translation: A feasibility study. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Athens, Greece, pp. 130–138 (March 2009) 22. Chomsky, N., Halle, M.: The sound pattern of English. Harper and Row (1968) 23. Corbett, G.G.: Agreement: the range of the phenomenon and the principles of the surrey database of agreement. Transactions of the Philological Society 101, 155–202 (2003) 24. Corbett, G.G.: Canonical typology, suppletion and possible words. Language 83, 8–42 (2007) 25. Corbett, G.G.: Deponency, Syncretism, and What Lies Between. In: Baerman, M., Corbett, G.G., Brown, D., Hippisley, A. (eds.) Deponency and Morphological Mismatches, vol. 145, pp. 21–43. The British Academy, Oxford University Press (2007) 26. Corbett, G.G., Fraser, N.: Network Morphology: a DATR account of Russian nominal inflection. Journal of Linguistics 29, 113–142 (1993) 27. Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of the Workshop on Morphological and Phonological Learning of ACL 2002, pp. 21–30 (2002) 28. Demberg, V.: A language-independent unsupervised model for morphological segmentation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 920–927 (June 2007) 29. Ernout, A., Thomas, F.: Syntaxe Latine, 2nd edn., Kliensieck, Paris (1953) 30. Fradin, B., Kerleroux, F.: Troubles with lexemes. In: Booij, G., de Janet, C., Sergio Scalise, A.R. (eds.) Selected papers from the Third Mediterranean Morphology Meeting. Topics in Morphology, pp. 177–196. IULA-Universitat Pompeu Fabra, Barcelona (2003) 31. Gaussier, E.: Unsupervised learning of derivational morphology from inflectional lexicons. In: Proceedings of the Workshop on Unsupervised Methods in Natural Language Processing, University of Maryland (1999) 32. Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistics 27(2), 153–198 (2001) 33. Halle, M., Marantz, A.: Distributed morphology and the pieces of inflection. In: Hale, K., Keyser, S.J. (eds.) The view from building, vol. 20, pp. 111–176. MIT Press, Cambridge (1993) 34. Harris, Z.S.: From phoneme to morpheme. Language 31(2), 190–222 (1955) 35. Hathout, N.: From wordnet to celex: acquiring morphological links from dictionaries of synonyms. In: Proceedings of the Third International Conference on Language Resources and Evaluation, pp. 1478–1484. Las Palmas de Gran Canaria, Spain (2002) 36. Hippisley, A.: Declarative Deponency: A Network Morphology Account of Morphological Mismatches. In: Baerman, M., Corbett, G.G., Brown, D., Hippisley, A. (eds.) Deponency and Morphological Mismatches, vol. 145, pp. 145–173. The British Academy, Oxford University Press (2007) 37. Jacquemin, C.: Guessing morphology from terms and corpora. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 156–165 (1997) 38. Juola, P.: Assessing linguistic complexity. In: Miestamo, M., Sinnemäki, K., Karlsson, F. (eds.) Language Complexity: Typology, Contact, Change. John Benjamins Press, Amsterdam (2008) 39. Keshava, S.: A simpler, intuitive approach to morpheme induction. In: PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes, pp. 31–35 (2006) 40. Kilani-Schoch, M., Dressler, W.U.: Morphologie naturelle et flexion du verbe français. Gunter Narr Verlag, Tübingen (2005) 41. Kiparsky, P.: Blocking and periphrasis in inflectional paradigms. In: Yearbook of Morphology 2004, pp. 113–135. Springer, Dordrecht (2005)

44

B. Sagot and G. Walther

42. Koskenniemi, K.: A general computational model for word-form recognition and production. In: Proceedings of the 22nd Annual Meeting of the Association for Computational Linguistics, pp. 178–181 (1984) 43. Lepage, Y.: Solving analogies on words: an algorithm. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 728–734 (1998) 44. Lieber, R.: Deconstructing Morphology: Word Formation in Syntactic Theory. University of Chicago Press, Chicago (1992) 45. Lovis, C., Michel, P.A., Baud, R., Scherrer, J.R.: Word segmentation processing: A way to exponentially extend medical dictionaries. In: Greenes, R.A., Peterson, H.E., Protti, D.J. (eds.) Proceedings of the 8th World Congress on Medical Informatics, pp. 28–32 (1995) 46. Matthews, P.H.: Morphology. Cambridge University Press, Cambridge (1974) 47. McCarus, E.N.: A Kurdish Grammar: descriptive analysis of the Kurdish of Sulaimaniya, Iraq. Ph.D. thesis, American Council of Learned Societies, New-York, USA (1958) 48. McWhorter, J.: The world’s simplest grammars are creole grammars. Linguistic Typology 5, 125–166 (2001) 49. Moreau, F., Claveau, V., Sébillot, P.: Automatic morphological query expansion using analogy-based machine learning. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 222–233. Springer, Heidelberg (2007) 50. Namer, F.: Morphologie, lexique et tal: l’analyseur dérif. In: TIC et Sciences Cognitives. Hermes Sciences Publishing, London (2009) 51. Pinker, S.: Words and Rules. Basic Books, New-York (1999) 52. Pirelli, V., Battista, M.: The paradigmatic dimension of stem allomorphy in italian verb inflection. Italian Journal of Linguistics, 307–380 (2000) 53. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980) 54. Pratt, A.W., Pacak, M.G.: Automated processing of medical english. In: Proceedings of the 1969 Conference on Computational Linguistics, pp. 1–23 (1969) 55. Rissanen, J.: Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory 30(4), 629–636 (1984) 56. Sagot, B.: The lefff, a freely available, accurate and large-coverage lexicon for French. In: Proceedings of the 7th Language Resource and Evaluation Conference, Valetta, Malta (2010) 57. Snyder, B., Barzilay, R.: Unsupervised multilingual learning for morphological segmentation. In: Proceedings of ACL 2008, Columbus, USA, pp. 737–745 (June 2008) 58. Spiegler, S., Golenia, B., Flach, P.: Promodes: A probabilistic generative model for word decomposition. In: Peters, C., Di Nunzio, G.M., Kurimo, M., Mostefa, D., Penas, A., Roda, G. (eds.) CLEF 2009. LNCS, vol. 6241, Springer, Heidelberg (2010) 59. Stroppa, N., Yvon, F.: Du quatrième de proportion comme principe inductif: une proposition et son application à l’apprentissage de la morphologie. Traitement Automatique des Langues 47, 33–59 (2006) 60. Stump, G.T.: Inflectional Morphology. In: A Theory of Paradigm Structure. Cambridge University Press, Cambridge (2001) 61. Stump, G.T.: Heteroclisis and paradigm linkage. Language 82, 279–322 (2006) 62. Tepper, M., Xia, F.: Inducing morphemes using light knowledge. Journal of ACM Transactions on Asian Language Information Processing (TALIP) 9(3), 1–38 (2010) 63. Thackston, W., Kurdish, S.: A reference grammar with selected readings (2006), ! " #$# (published online) 64. Thornton, A.M.: Towards a typology of overabundance (December 2010), presented at the Décembrettes 7, Toulouse, France 65. Tribout, D.: Les conversions de nom à verbe et de verbe à nom en français. Ph.D. thesis, Université Paris Diderot – Paris 7 (2010)

Non-canonical Inflection

45

66. Walther, G.: A derivational account for Sorani Kurdish passives (2011); presentation at the 4th International Conference on Iranian Linguistics (ICIL4), June 17-19, Uppsala, Sweden (2011) 67. Walther, G.: An inferential realisational model for inflectional morphology. Linguistica 52 (2011); internal and External Boundaries of Morphology (accepted) 68. Walther, G.: Latin passive morphology revisited (2011); presentation at the 2011 Meeting of the Linguistic Association of Great-Britain (LAGB 2011), September 7-10, Manchester, UK (2011) 69. Xanthos, A.: Apprentissage automatique de la morphologie — Le cas des structures racineschéme. Sciences pour la Communication 48, Peter Lang (2008) 70. Zauner, A.: Praktická príruˇcka slovenského pravopisu. Vydavat’el’stvo Osveta, Martin, Slovakia (1973) 71. Zweigenbaum, P., Grabar, N.: Liens morphologiques et structuration de terminologie. In: Actes de IC 2000: Ingénierie des Connaissances, pp. 325–334 (2000) 72. Zwicky, A.M.: How to describe inflection. In: Niepokuj, M., Clay, M.V., Nikiforidiou, V., Feder, D. (eds.) Proceedings of the Eleventh Annual Meeting of the Berkeley Linguistics Society, pp. 372–386. Berkeley Linguistics Society (1985)

A User-Oriented Approach to Evaluation and Documentation of a Morphological Analyzer Gertrud Faaß Institut für maschinelle Sprachverarbeitung Universität Stuttgart, Azenbergstraße 12, D-70174 Stuttgart, Germany

Abstract. This article describes a user-oriented approach to evaluate and extensively document a morphological analyzer with a view to normative descriptions of ISO and EAGLES. While current state-of-the-art work in this field often describes task-based evaluation, our users (supposedly rather NLP non-experts, anonymously using the tool as part of a webservice) expect an extensive documentation of the tool itself, the testsuite that was used to validate it and the results of the validation process. ISO and EAGLES offer a good starting point when attempting to find attributes that are to be evaluated. The documentation introduced in this article describes the analyzer in a way comparable to others by defining its features as attribute-value pairs (encoded in DocBook XML). Furthermore, the evaluation itself and its results are described. All documentation and the created testsuites are online and free to use: . Keywords: German, documentation, evaluation, validation, verification, eHumanities, ISO 9126, EAGLES, morphological analyzer.

1 Introduction User-oriented evaluation methodologies for NLP systems have developed since the early eighties, when Larry R. Harris [11] (found in [27, p. 39]) stated the following: Those of us on this panel and other researchers in the field, simply don’t have the right to determine whether a system is practical. Only the users of such a system can make that determination. Only a user can decide whether the NL capability constitutes sufficient added value to be deemed practical. Only a user can decide if the system’s frequency of inappropriate response is sufficiently low to be deemed practical. Only a user can decide whether the overall NL interaction, taken in toto, offers enough benefits over alternative formal interactions to be deemed practical. Since EAGLES [6] took up challenges and solutions described by ISO 9126 (which was concerned with a formal evaluation of software in general), a number of important issues have been described. However, concerning tools that annotate linguistic information, like, e.g. morphological analyzers or parsers, those issues seem to be ignored in C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 46–66, 2011. c Springer-Verlag Berlin Heidelberg 2011

Evaluation and Documentation of a Morphological Analyzer

47

some cases. Such systems today are typically evaluated by comparing their output to a default; as such their quality is measured and different systems are compared to each other on how well they match what was expected. The yearly Morpho Challenge (e.g., [16] or [17]), or CoNLL1 are examples for such evaluations. Criteria for developing such defaults on the other hand seem to be neglected nowadays, for an interested user usually does not find a documentation describing the (linguistic) principles and methods by which these have been produced. Therefore – and at least for the NLP systems doing part-of-speech-tagging, morphological analysis, parsing or other annotation tasks – Thompson’s requirement of “The standards of NL systems performance will be set by the users” [27, p. 42] has not yet become a reality2 . So far, this practice has not been hindering the use of these systems, because usually NLP-experts were utilizing them. However, on the horizon, other challenges for usability of such tools are developing: for a few years, there have been initiatives (e.g. CLARIN3 to make NLP tools available to non-experts who plan to use NLP tools and methodology to do empirical research in their fields, e.g. in philology or history (see the eAqua project4). Some such users might already be familiar with NLP tools, however, there are historians and philologists that one cannot expect to learn about bleu-scores and other statistical measures or to interpret results of task-based evaluations like, e.g., CoNLL shared tasks. Such users rather require detailed documentation about issues they are interested in, like in what quality their given input text can be analysed (on the basis of their requirements) and in which granularity the analysis can take place. Other questions may be about whether the functionality of the tool can be manipulated by the user, or simply: how to run it. When sufficient description is offered, the decision upon whether a tool is fulfilling specific, yet unknown needs, is at least facilitated. Whenever several, similar tools are available to such a user, the descriptions should be kept similar as well, so comparison is possible. It should also not be necessary to download and install the software; therefore, an online processing (also of rather big amounts of data, as e.g. offered by the service “WebLicht”, see [13]) should additionally be made possible5 . This article describes a user-oriented methodology for the design and documentation of a testsuite for morphological analysis as a case in point. It also depicts a validation procedure for a respective software and the description of a user-friendly documentation thereof. We do not assume a morphological analyzer of (modern) German to be the tool such researchers are interested in, however, the described efforts may serve as an example of how the requirements described above can be taken into account before evaluating NLP software that is to be made available to NLP non-experts. In addition, 1 2 3 4 5

We explicitely exclude dialogue and language generation systems. The requirements mentioned here have been discussed in detail by researchers and lecturers of the humanities (inter alia linguists, historians and philologists) at the CLARIN BBAW Berlin Workshops "eHumanities und Sprachressourcen – die Benutzerperspektive(n)" on 17th January, 2011, and "Sprachressourcen in der Lehre" on 18th January, 2011, both in Berlin, and both organized by the University of Gießen.

48

G. Faaß

this article thereby also presents an approach to the evaluation of a rule-based morphological analyzer. The following sections will review earlier developments in evaluation in some detail, describing the traditional construction of test material and requirements for their documentation to give a background (section 2.1) and to position our approach with respect to state-of-the-art work. We will then introduce SMOR [24], in section 5 as a case in point and describe how this morphological analyzer has been evaluated and documented (from section 6 on). Lastly, open issues will be discussed.

2 Background 2.1 ISO 9126 and EAGLES EAG-EWG-PR.2 The starting point for our review is the ISO 9126 standard [14, Version 2.1.1] from where EAG-EWG-PR.2 [6] develops an extended version (2.1.2) specifically for NLP software. The ISO 9126 standard (2.1.1) refers to the evaluation of software products in general. Figure 1 (adapted from the UsabilityNet6 pages) gives an overview of the year 2000 version of this standard, ISO 9126-1. The quality model as shown in figure 1 is itself divided into six features, EAGLES’ EAG-EWG-PR.2 concentrates on two of them, Functionality and Usability. Functionality is to be defined as to “satisfy stated and implied needs” [6, p. 11], of which the “implied needs” are to be identified and defined in environments other than that of a contractual kind7 . Usability is to be evaluated from the perspectives of a set of users. This set consists of the direct and indirect users of the software, i.e. operators and the receivers of outputs, but also managers, who “may be more interested in the overall quality rather than a specific quality characteristic” [6, p. 11].8 Usability contains aspects like efficiency and effectiveness, too. All characteristics of evaluation may be broken down into quality attributes which may be quantified. Such leads to several sets of attribute-value pairs, some describing the developers expectations, others the expections of different users. For example, the stated needs of the functionality attribute could define suitability issues like “covering inflection and/or derivation and/or compounding”, accuracy issues could be a minimum recall of 98% when analysing word forms from a modern newspaper text, implied needs could include a minimum recall of 80% when analysing word forms from a web corpus, etc. 2.2 Comparable Evaluations of NLP Systems Especially for NLP systems, user needs usually have to be implied, as only the systems’ characteristics are described in their documentation in terms of functions that its devel6 7

8

A European Union project providing usability and user centred design resources to practitioners, managers and EU projects, see ! . A reference to ISO 8402:1986, note 1, is given at this point, however, this standard (on quality vocabulary) has been withdrawn. The latest revised version is ISO 9000:2005 on quality management systems – Fundamentals and vocabulary. Nigel Bevan [5, abstract] refers to these qualities as “internal quality (static properties of the code), external quality (behaviour of the software when it is executed and quality in use (the extent to which the software meets the needs of the user)”.

Evaluation and Documentation of a Morphological Analyzer

49

quality in use effectiveness, productivity, safety, satisfaction functionality accuracy suitability interoperability security usability

reliability maturity fault tolerance recoverability availability efficiency

understandability learnability operability attractiveness

maintainability analyzability changeability stability testability

time behavior resource utilization

portability adaptability installability co−existence replaceability

Fig. 1. Overview of ISO 9126-1

opers have assumed to be necessary (task-based definition of a system). In other words, (potential) users of such systems may have to be interviewed, before their attributes can be defined. The scopes of operations can be different for basically comparable systems: for example, a mere inflectional analysis performed by one morphological system might indeed be what the developers wanted to achieve in one system, while other systems are supposed to deliver word structures analysed in a finer granularity. To enable such comparisons, the respective attribute-value pairs (e.g. “granularity” = “inflection”, “derivation”, “composition”) are to be defined9. 2.3 Summary In summary, an evaluation description based on the EAG-EWG-PR.2 must contain three elements: object and user descriptions, which are realized as measurable attribute-value pairs and a library containing test suites and test procedures. While the object description shows the system’s perspective (described in its documentation) and contains implementation goals the developers wanted to achieve with their software, the user 9

The resulting different analyses can however still be compared automatically, see [26].

50

G. Faaß

description will consist of attribute-value pairs containing probable users’, or user groups’ wishes and needs. The two perspectives may conflict at some stage. The range of legal values that may result from testing a system should be clearly defined beforehand – and independently of a single system. Testing methods may range from fully automated tests to manual inspection of (at least partial) processing results/outputs. One should be aware of the fact that some of the evaluation methods proposed by EAGLES cannot be performed automatically. However, when taking into account that the basic idea behind such a semi-automatic evaluation is that of parametrisation, an objective judgement of an NLP system is indeed possible. The Final Report suggests evaluation of systems from spelling checkers to Machine translation systems [6, p. 9ff], therefore covering a rather wide range of typical NLP tools.

3 Developments This section focuses on some of the developments after the recommendations described above had been published. 3.1 Terminology Developments Since the EAGLES final report, a more specific terminology for evaluation has been developed, helping to refine its criteria. Concerning the term Evaluation itself. Underwood’s definition [28], refering to [25, 20] is still valid: “... the user and the context of use are central in evaluating an NLP resource. Thus a resource is evaluated with respect to how it can be successfully used to fulfil specific tasks within a specific setup”. On the other hand, concerning the “functionality” aspect of evaluation, the terms Verification and Validation are now in use, defined by [10, 3] as follows: Verification – the process of ensuring 1) that the intelligent system conforms to specifications, and 2) its knowledge base is consistent and complete within itself. Validation – the process of ensuring that the output of the intelligent system is equivalent to those of human experts when given the same inputs. 3.2 Application of the Terms Verification, Validation and Evaluation An attempt to “update” the EAGLES final report respectively could hence read as follows: – Verification describes several tasks, of which some will be performed by the tool developer: testing the system according to its design criteria, documentation of the criteria and the respective tests – and their results – for interested users (see figure 110 ): • Accuracy can be tested on the basis of a testsuite, i.e. a set of “gold” analyses, for which a number of design criteria should be taken into account (see section 6.1); 10

A description of security issues would go beyond the scope of the present article.

Evaluation and Documentation of a Morphological Analyzer

51

• Suitability describes the appropriateness of a set of functions for specified tasks (as described by ISO 9126:1991, A.2.1.1); • Interoperability should document accepted input formats and possible outputs (see section 7.1), The conformity to (e.g. ISO-) international standards, and formats for interchange. – Validation describes the process of developing guidelines, valid for both, the tool and the human expert, and the testing of equivalence of their outputs (testing the “stated needs”, such guidelines are described in section 7.3). – Evaluation serves to describe and to verify to what extend the users’ requirements (see section 7.4) are met by the software. 3.3 Re-usability and Sustainability Additionally, to build NLP resources is expensive, therefore, aspects of re-usability and sustainability come into play when designing its documentation: nowadays, any resource developed, may it be a NLP-tool or a corpus or even a testsuite, should be documented as richly as possible to make sure that future users of the resource will know exactly its design, its strengths and weaknesses, which will enable them to judge whether the resource is usable for their purposes (that are currently unknown to the developers). Another aspect comes into play when offering tools as part of web services: whenever such a tool is updated, the previous version should still be made available, especially for users that need to re-produce outputs (for example, if the tool is used for teaching purposes). Any such resource developed today should moreover be made available on the web whenever this is possible, its metadata should be harvested (e.g., by the CLARIN Virtual Language Observatory (VLO)11 ) and sustained on independent servers together with its documentation.

4 Validation and Evaluation Methodologies It must be stated clearly that an article on the evaluation of a symbolic morphological analyzer cannot cover all kinds of methodologies that have been conducted, instead, one can only pick some of the ones that seem relevant for the purpose. Methods that focus on metrics allowing for automatic comparison of morphological analyzers are neglected here (as one of the latest developments, see Spiegler and Monson, see [26]). For other, more general aspects of NLP evaluation, see Spärck-Jones and Galliers in [25], for a critical review of the latest developments in HLT evaluation, see Belz in [4]. Section 4.1 will describe TSNLP [18], which develops a methodology to generate test suites for parsers: a testsuite for morphological analysing could be generated in a similar way. The methods of Underwood (as described in [28]) are mentioned in section 4.2, examining validation methods for electronic lexicons aiming at satisfying the needs of a range of users whose requirements are not known. Lastly, section 4.3 describes evaluation at work – how important criteria in a task-oriented evaluation of morphological analyzers can successfully be applied. 11

"

52

G. Faaß

4.1 Test Suites for NLP (TSNLP) – A Methodology to Evaluate Syntactic Analyzers The EAGLES report [6, p. 49], and also Spärck-Jones and Galliers [25] mention inter alia the TSNLP project [18], where a methodology and a tool to generate test suites for syntactic analyzers are developed. Some of the statements on the content of testsuites examined by [18] are of interest for evaluating also other tools than parsers and are hence quoted here: – Size and domain of vocabulary and phenomena are to be defined (controlled test set); – tests should progress from simple to more complex phenomena (progessivity); – the test suite should not only contain well-formed, but also ill-formed items, the depth of coverage of both is to be predefined (systematicity). 4.2 A Validation Methodology for NLP Lexicons SMOR has just been introduced as a web service, therefore, experiences of users new to NLP have not been reported. So how can users’ requirements be met if they are not known yet? Underwood in [28] develops a methodology for validating NLP lexicons. Refering to Spärck-Jones and Galliers [25] and in line with the EAGLES standard on evaluation, she states that validation and evaluation have in common “that the user and the context of use are central in evaluating an NLP resource”. From the system’s perspective, many different variants of lexicons exist, not only in terms of their size and subject domain, but also in terms of depth of description of their contents. Hence validation should examine whether the lexicon contains the information that it claims to contain. When turning to the user’s perspective, for NLP lexicon evaluation, Underwood in [28] argues that there are too many unknown factors when trying to determine it: “the general validation methodology cannot prescribe hard and fast user requirements which must be met.” Therefore, any validation should also provide information to any potential user who then can select the most appropriate lexicon. We hence interpret [28] in a way that extensive documentation of SMOR may serve at least as an opportunity for the (potential) users when they decide whether the tool meets their requirements.

4.3 An Evaluation of Morphological Analyzers for a Real-World Application An example of an evaluation according to the criteria described above can be found in [19]: Mahlow and Piotrowski describe their criteria for the selection of morphological analyzers and their experiences with the availability and installation of the tools that were chosen according to these criteria. System requirements, the experimental setup, the processing of the tests and the evaluation results are described in a way that other users gain a good knowledge of the evaluated tools.

Evaluation and Documentation of a Morphological Analyzer

53

5 The Stuttgart Morphological Analyzer (SMOR) 5.1 The Tool and Its Knowledge Basis [24] is a finite-state morphology that was implemented with the SFST tools12 , see [23], covering inflection, derivation, and compounding. The morphological analyzer for German utilizes the IMSLex lexicon13, see [9]. In the present state (May 2011), it contains a rather comprehensive list of 47,671 German base stems, 528 compounding and 1,691 derivation stems14 as well as 321 prefixes and 133 suffixes. Each morphological unit is labeled with features which are described in [22]. Features of stem entries specify the part of speech, the stem type (base, derivation, or compounding stem), the origin (native, foreign, and several subtypes of neoclassical stems), as well as the inflection class in case of base stems. Entries are utilized via selection restrictions. Suffix entries encode the part of speech, stem type (derivation or compounding), and origin of the derivational basis. The wordform resulting from suffixation is described by its part of speech, stem type, origin, wordform class (simplex, prefix derivation, suffix derivation), and inflection class (in case of a base stem). Morphophonological rules (here, for space reasons, we refer to [21] for more details) are applied to generate the correct surface forms. The tagset currently contains about 572 tags (including tags only contained in the lexicon, i.e. not appearing in analyses). SMOR

5.2 WebLicht – A Prototypical Webservice for Open Use SMOR is offered for online use as part of the webservice “WebLicht”, supplied by the German D-SPIN project15 , the German section of CLARIN16 . This webservice is conceived to be used by researchers and lecturers of the (e-)Humanities, making NLP tools and resources available to them. So far, the tools are solely documented from a developer’s perspective with the typical metadata on accuracy and suitability in a published article. Our project aims at evaluating and documenting the software for linguists and especially philologists who are not accustomed to such tools but plan to make use of them in their research and teaching. It is planned to put up user questionnaires on the web service pages with the goal of adapting documentation and tool(s) to user needs.

6 Methods 6.1 Designing a Testsuite There are a number of questions to be taken into account when developing a testsuite, and all answers usually are trade-offs between the approach of representativeness and #$%&'()*#%#& +,#- 14 Additional stems are generated automatically. 15 , see [1] 16 12 13

54

G. Faaß

the objective of keeping the necessary efforts low. The list of word forms to be used for testing a morphological analyzer should on the one hand be selected randomly, on the other hand, one must make sure that a number of different challenges in morphological analysis for German will be tested (regular and irregular forms, neoclassical forms, syncretisms as results of word formation processes, umlaut and ablaut17 derivations, stem changes, etc.). The project team therefore decided to select word forms from a big corpus that appear at least in a medium frequency. Another design decision was made taking into account that SMOR has been developed over a number of years and that it can nowadays be expected to correctly analyse all words belonging to the closed, i.e. non-productive classes, like function words, etc. For a new tool, a testsuite should also contain some of these forms; for SMOR, evaluating the three productive classes of common nouns, verbs and adjectives were deemed sufficient. Lastly, an important design decision was based on Thompson, see [27]: “negative results are as important as positive ones” and on the design criterion of systematicity proposed by TSNLP (see section 4.1): the selection of the testsuite was done fully at random: 1,000 word forms were collected with a script only by their parts of speech and with the minimum requirement of each to appear at least with medium frequency from the 880 m token corpus SD E WAC, a partially cleaned web corpus on the basis of the D E WAC corpus [2]. SD E WAC contains 1,985,291 noun candidates, 11,142,701 (finite main) verb candidates, and 303,197 adjective candidates of calculated medium frequencies (6 occurrences for nouns, 11 for verbs, and 8 for adjectives), see [8]. By selecting the word forms at random, we cannot name our testsuite a controlled test set, as proposed by TSNLP, however, by extensively documenting each form afterwards using linguistic categories, a controlled test set containing specific features can be selected and tested separatly. As the features documented are of different complexity, it is also possible to select entries of the testsuite progessively (see section 7.1). Concerning the actual morphological analysis of some words, linguists can formulate a wide variety of perspectives. This is especially the case for German, with its rich morphology: here, designing a testsuite, i.e. deciding for default (“gold”) analyses, can only be the result of discussions, doing surveys and making compromises. The word Auskunft ‘disclosure’, for example, etymologically has developed from a nominalization of the particle verb (her-)auskommen ‘[to] come out’. However, the nominalization process of deriving a noun consisting of particle+kunft from the verb consisting of particle+kommen is not productive anymore, though a variety of such nouns exist (Herkunft ‘origin’, Unterkunft ‘accommodation’, Zusammenkunft ‘gathering’, just to name the most frequent ones in SD E WAC. Nowadays, a German native speaker would not see Auskunft as a nominalization, but rather as an opaque common noun. The noun Ankunft ‘arrival’, on the other hand is usually identified as a nominalization of the particle verb ankommen ‘[to] arrive’, therefore, a derivational analysis of this word can be expected.

17

Some verbs, when being nominalized, have their stems changed, e.g. betreiben ‘[to] operate’ can be nominalized as Betrieb ‘company’. In the documentation, all stem changes other than umlaut were labelled ablaut, like, e.g. the nominalization Gesandte ‘ambassador’, derived from the past participle form of the verb senden ‘[to] send’.

Evaluation and Documentation of a Morphological Analyzer

55

It should be clear to the designers of the testsuite and to its users that such decisions are usually made on a “common ground” basis and have no general validity whatsoever. Therefore, it was decided to label word forms that could have been finely analysed in terms of derivational and compounding processes, but were taken into the testsuite only with a coarse analysis, as opake Analyse bevorzugt ‘opaque analysis’: Such labels enable the users to find these analyses in the testsuite and to adapt the testsuite for these word forms to their own linguistic perspective. 6.2 The Design of Linguistic Category Tables to Describe Nouns, Verbs and Adjectives figure 2 on page 56 shows a screenshot of the documentation, describing the chosen noun properties and their possible values. The category Allgemeines ‘general information’ contains an entry describing the property Ablaut. Secondly, it describes whether the respective noun or – in case of a compound – the head of the noun is a neoclassical form (neokl. Wortbildung (Kopf)) and whether the noun contains part(s) that is/are to be analysed opaque (opake Analyse bevorzugt). The second main category Wortbildung ‘word formation’ describes whether the noun is a compound and/or result of a derivation. For the compound nouns, its number of stems and linking elements are listed. In the case of compounding, non-head can be neoclassical, too; these are entered in the column neokl. Nichtkopf. The sub-category derivation contains detailed prefixing and suffixing information. As the testsuite may also be used for testing another morphological analyzer which produces hierarchical structured word forms, a first assumption of such a hierarchy is given for every word form (column Annahme über den Worbildungsvorgang). Verbs are described in four categories: word formation (German, neoclassical or foreign word formation), derivation (e.g. whether the word form basis was of another part of speech than verb, or whether the word form is a particle or a prefix verb, etc.), and inflectional information, e.g. whether the verb is a past participle form (Partizip 2). Lastly, the (rare) cases of composition can be marked. The adjective descriptions contain similar information like described for nouns, however, categories describing comparison forms are added. As adjectives are often derived from verbs and from nouns, these base forms are not described any further in the adjective category table, but linked with respective entries in the nouns and the verbs category tables. Some verbs and some adjectives found in the respective lists were found to be syncretisms, e.g. past participle forms like verglichen ‘compared’, or derivations like verliebt ‘in love’. For these, a separate category table (on the basis of the verbs’ table) and a separate test suite was generated. In the testsuite, analyses for both categories, verb and adjectives, are contained. The category tables for all word forms that are part of the test suite are downloadable in different formats18.

18

See . to /

Fig. 2. Noun categories described

56 G. Faaß

Evaluation and Documentation of a Morphological Analyzer

57

7 Results and Documentation 7.1 Designing a Template for Documentation We make use of DocBook XML19 , because XML structures in general fulfill the requirements of EAGLES: the attribute-value pairs utilized for evaluation and documentation can be easily managed and organised hierarchically (see examples in [7]). The DocBook Schema was designed for writing documentation and is widely used. With DocBook XML, it is possible to utilize existing scripts to produce html pages to be put onto the web immediately. Another pro-argument is that some of the structures, e.g. the category tables describing the word forms contained in the test suite, can be produced as .csv and be transformed automatically to DocBook format with shell scripts. The documentation produced for our purposes is divided into three parts, where the first part is to document the software itself, the second a testsuite on which a frozen version of the tool was validated and, lastly, the results of this validation effort (see also [7]). Documentation of the Software. The documentation of the software contains information on backgrounds of its development and it links to publications and (related) projects. Such information and information on availability and licence issues are planned to be contained in the foreseen documentation of NaLiDa20 , therefore, to avoid double work, only a link to these documents will be included in the future. For an interested user, however, other issues are also of significance: the online documentation should contain descriptions of possible outputs in the form of typical examples. All labels that the tool can assign to word forms are explained. To be in line with the EAGLES final report requiring feature structures for all fields of evaluation, the documentation is to be interpreted as attribute-value pairs of “case X leads to analysis/analyses Y”. Table 1, taken from the online documentation, describes the word form infizieren ‘infect’ as a case in point to demonstrate how a neoclassical verb is analysed by SMOR. The analyses contain the base stem (infiz) which is categorized as verbal (V), and a derivational suffix (ier<SUFF>). This suffix is verbal, therefore the word form as a whole is identified as a verb, which is marked by a plus (+V). The remaining fields in table 1 show inflectional information: The word form is syncretic, as it describes the 1st and the 3rd person plural present tense (1-Pl-Pres, 3-PlPres), each as indicative (Ind) and subjunctive form (Subj); the word form is identified as an infinitive (Inf ), too. The second part of the table (headed by Flexion plus Lemma-Information) describes the alternative analaysis that SMOR offers: Here, only the inflectional information is described, additionally, a lemma is given21. 21 Note that currently (May 2011), the alternative testsuite format has not yet been put online. However, respective work is in progress, check for updates. 19 20

58

G. Faaß Table 1. Example analyses of SMOR : infizieren

Basis

Tag

infiz infiz infiz infiz infiz

Komposition, Derivation und Flexion Derivationsaffix Tag (+Wortart Person Numerus ier<SUFF> ier<SUFF> ier<SUFF> ier<SUFF> ier<SUFF>

<+V> <+V> <+V> <+V> <+V>

Tempus

Modus) <Subj> <Subj>

<1> <1>

<3> <3>

Flexion plus Lemma-Information infizieren infizieren infizieren infizieren infizieren

<+V><1><Subj> <+V><1> <+V> <+V><3><Subj> <+V><3>

Another important issue is that of informing the user that there are other methods of morphological analysis than the one used for the system currently documented: SMOR e.g. delivers a flat sequence of morphemes, while other systems like, e.g. Canoo22, show hierarchical word structures. SMOR uses a descriptive approach, while it is indeed possible to also describe word forms from an etymological perspective. Lastly, SMOR (as part of the WebLicht services by D-Spin23 offers a full analysis, i.e. derivational, compounding and inflectional information or, as an alternative, lemma and inflectional information. Other systems might only offer the latter, etc. We foresee categories/output formats not offered by the tool documented here in order to ease the re-use of it for producing documentation for tools performing similar tasks. In the online documentation, the features and their possible values are described in “Tabelle 1.3”, see table 2, the bold face entries show the features of SMOR: its Art der Analyse ‘type of analysis’ is descriptive and not etymological, it contains a list of elements instead of their structure. The tool can do analysis on two levels: one describing composition, derivation and inflection in different granularities, the other resulting in lemma and inflectional information only. Documentation of the Testsuite. In section 6.1, we have defined the criteria utilized for the design of the testsuite. Especially for morphological analyses, the views of linguists are rather heterogeneous, therefore, when publishing a testsuite, its documentation must describe these criteria and reflect the guidelines that the human evaluators have developed and used, see e.g. for nouns, chapter Nomina: Annotationen auf Wortebene ‘Nomina: annotations on word level’. Figure 3 on page 60 demonstrates the first page of the documentation for nouns, according to the category table shown in figure 2, which is described in section 6.2 22 23

Evaluation and Documentation of a Morphological Analyzer

59

Table 2. Tabelle 1.3 Software: Art der Analyse Attribut

Wert1

Wert2

Art der Analyse Art der Ausgabe Granularität der Ausgabe

deskriptiv Liste Komposition, Derivation und Flexion

etymologisch Struktur Flexion plus Lemma-Information

above. The first column (Nr.) gives the unique number of the word form, with this number, the analyses can be found in the testsuite, of which figure 4 on page 4shows some examples. Documentation of the Validation Results. In this chapter of the documentation, the metrics used for validation of the tool are explained. Concerning false negatives and positives, we expect the user wanting to know about the shortcomings of the tool, too. Especially new users of such tools, after having read a conference article, may expect “perfection” and will turn away rather disappointed after finding that 98,5% accuracy and 99% recall still means that with an arbitrary test set of 10.000 words, some hundred words will be analysed wrongly or not at all. Here, it was deemed more informative to list the word forms for which the tool has failed to produce a correct analysis – or any analysis at all. Such honesty will enable the user to understand similar cases where one can expect the tool to fail, in other words: its realistic application possibilities in their framework. 7.2 Calculating Recall and Precision As word forms are the input to morphological analyzers and as recall is foreseen to calculate coverage, we deemed it best to calculate recall on the basis of the two attributes: True Positives (word forms that were to be analysed and that were analysed by the system) and False Negatives (word forms that were to be analysed but that were not analysed by the system). A first run with a frozen version of SMOR however revealed that the part of speech tagging done on a web corpus performs worse than usual: such taggers are trained on “clean” data. SMOR did – rightfully – not analyse several hundred forms that had been labeled as “verbs”, “nouns” or “adjectives”; these contained typos or were foreign words. Especially the label “NN” (common noun) showed to be problematic, as many named entities had been labeled as such. There were also cases where verbs had been labeled nouns or vice versa (these were deleted from the lists as well). Of the total 3,000 word forms, the following candidates remained as test suite candidates to be documented in full: – – – –

726 nouns, 520 verbs, 763 adjectives 315 verb and adjective syncretisms (40 forms appeared in the adjectives’ list, 275 in the verbs’ list),

Fig. 3. Description of the testsuite for nouns (excerpt)

60 G. Faaß

Fig. 4. Description of the testsuite for nouns (excerpt)

Evaluation and Documentation of a Morphological Analyzer

61

62

G. Faaß Table 3. All SMOR analyses of the word Verkaufsgespräch

No. 1 2 3 4 5 6 7 8 9

SMOR

analyses

verkaufen<SUFF>Gespräch<+NN><Sg> verkaufen<SUFF>Gespräch<+NN><Sg> verkaufen<SUFF>Gespräch<+NN><Sg> verkaufen<SUFF>Gespräch<+NN><Sg> verkaufen<SUFF>Gespräch<+NN><Sg> verkaufen<SUFF>Gespräch<+NN><Sg> VerkaufGespräch<+NN><Sg> VerkaufGespräch<+NN><Sg> VerkaufGespräch<+NN><Sg> Table 4. All SMOR analyses of the word Morgentau No. 1 2 3 4 5 6

SMOR analyses MorgenTau<+NN><Masc><Sg> MorgenTau<+NN><Masc><Sg> MorgenTau<+NN><Masc><Sg> MorgenTau<+NN><Sg> MorgenTau<+NN><Sg> MorgenTau<+NN><Sg>

The human evaluators had a difficult task: On the one hand, SMOR’s typical, rather overgenerating way of producing correct analyses of different granularity is a feature that the developer had planned, i.e. in calculating precision, these analyses could not be categorized as being wrong (see the tables below for examples) – on the other hand, there were guidelines developed saying that only analyses of the finest granularity were to be accepted. Therefore, in calculating the accuracy, all analyses that were deemed to be correct and required from a developer’s point of view were accounted for as correct, however, only the ones containing the finest granularity were taken into the testsuite. Table 3 shows the tool’s output for word no. 19, Verkaufsgespräch ‘sales conversation’: these analyses were all deemed correct by the assessors, analyses 1 to 6 were being categorised as KD (meaning compounding and derivation, taken over into the testsuite for it the finest possible analysis), analyes 7 to 9 as K for compounding only. Keeping and marking these less finer analyses enables us to also build testsuites containing compounding only at a later stage, in case this will be deemed necessary. Another case is shown in table 4 containing SMOR-analyses for word no. NN-047, Morgentau ‘morning dew’. Analyses no. 1 to no. 3 are deemed correct by the assessors (and taken into the testsuite) because here, the head noun is correctly identified as the masculin Tau ‘dew’, while analyses no. 4 to no. 6 determine the head as the neutral noun Tau ‘rope’ and are therefore deemed wrong. The latter were categorized as -1 and counted as false positives when calculating precision. Recall/Precision. Table 5 shows the results for recall on the basis of word forms that were calculated after the complete sets were categorized and after the respective

Evaluation and Documentation of a Morphological Analyzer

63

Table 5. Recall on the basis of word forms POS Nouns Verbs Adjectives

Total 1,000 1,000 1,000

Not processed True False 219 149 146

Deleted

13 2 32

55 56 51

Processed Testsuite

Recall

726 795 (520 V + 275 V+ADJ) 803 (763 ADJ + 40 ADJ+V)

98.2% 99.7% 96.0%

Table 6. Precision on the basis of analyses POS

Description

NN

V

V+ADJ/ADJ+V

ADJ

No.

Percentage

total true positives false positives precision

6,174 4,414 1,759

100.00

total true positives false positives precision

2,270 2,147 123

total true positives false positives precision

3,129 2,878 251

total true positives false positives precision

11,239 8,200 3,039

71.49 100.00

94,58 100.00

91.98 100.00

79.84

testsuites had been produced. In table 5, column 3 contains the number of word forms not processed that were deleted from the test suites. These word forms were found to be typos, foreign words, abbreviations, etc. Column 4 shows the number of false negatives, for those, analyses were created manually and added to the final testsuites. In the list of the processed word forms, there were also a number of entries not to become part of the respective test suite (e.g. because they were of other parts of speech), see column 5. Column 6 lists the number of word forms that remained in the test suites (note that the online documentation24 lists all the respective word forms). Table 6 shows the calculated precision values on the basis of the analyses SMOR had produced. Here, we differentiate between the verbs and adjectives and the syncretic ones of both parts of speech (POS). The high number of false positives can be related directly to the high number of stems in the analyzer. SMOR is intended by its designer to try finding as many analyses as possible, hence it is usually overgenerating.

24

See section Testsuite Dokumentation: “True Negatives” – Erläuterungen und Listen.

64

G. Faaß

7.3 Developing Guidelines It was said above, that developing guidelines for morphological analyses of German word forms is rather a result of discussions and perspectives than facts to rely on. Especially in such a case, the documentation of the resulting guidelines must be extensive – and the developers of it should be open for discussions. Therefore, the online documentation asks users for feedback. 7.4 Assumed User Requirements So far, only expert users have utilized SMOR, therefore we are in the same situation as described in section 4.2. Evaluation of SMOR currently rather means that “the resource in question is what it claims to be and providing sufficiently detailed information ... to enable potential users to decide for themselves whether they wish to acquire it” [28, p. 130]. However, talks with potential users revealed that we can expect at least two user groups: one wanting to use SMOR as an instrument of deep, finely grained analysis of word forms and another, wanting to make use of SMOR as part of a processing chain, where usually only lemma and inflectional information are relevant. Further user requirements will be defined and implemented following the results of an online survey planned for 2011.

8 Conclusions and Future Work This article describes a methodology and an implementation of a user-oriented evaluation of a morphological analyzer as a case in point. The methodology was developed mainly according to principles described by [14, 6, 18], and [28]. These lead to producing extensive documentation of the tool, tests, and testing results. DocBook XML is utilized as a describing system because it enables the evaluator to define evaluation features by using attribute-value pairs, which can be structured hierarchically. DocBook XML can easily be transformed to .pdf and to .html which allows for the documentation being browsable online. All resulting documents, testsuites and the tables describing the testsuite entries according to linguistic principles are freely available for download on the web25 , some of them as comma separated files (.csv) allowing for further automated processing, e.g. the generation of testsuites containing word forms that show specific linguistic features. Still, several tasks remain: An online survey to be conducted this year is expected to show who is using the tool and inform about users’ requirements not yet foreseen. The requirements will have to be implemented and documented. Acknowledgments. We are very grateful towards Ulrich Heid (University of Hildesheim, formerly University of Stuttgart) and Helmut Schmid (University of Stuttgart), who both gave important input and made many very useful remarks throughout the duration of the project. During the first months, Fabienne Cap took an active part in the discussions, too, and we thank her for her important contributions. We also cordially thank our students Alexandra Kolb, Natali Mavrovi´c and Ronny Jauch who contributed to the work on the testsuite and its design with much enthusiasm and perseverance. 25

Evaluation and Documentation of a Morphological Analyzer

65

References 1. Bankhardt, C.: D-SPIN – Eine Infrastruktur für Deutsche Sprachressourcen. Sprachreport 25(1), 30–31 (2009) 2. Baroni, M., Kilgarriff, A.: Large linguistically-processed web corpora for multiple languages. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 87–90 (2006) 3. Barr, V.B., Klavans, J.L.: Verification and Validation of Language Processing Systems: Is it Evaluation? In: ACL 2001 Workshop on Evaluation Methodologies for Language and Dialogue Systems, pp. 34–40 (2001) 4. Belz, A.: That’s Nice. . . What Can You Do With It? Comp. Ling. 35(1), 111–118 (2009) 5. Bevan, N.: Quality in use: Meeting user needs for quality. J. Sys. Software 49(1), 89–96 (1999) 6. EAGLES: Evaluation of Natural Language Processing Systems, EAG-EWG-PR.2, final report (1996) 7. Faaß, G., Heid, U.: Nachhaltige Dokumentation virtueller Forschungsumgebungen. In: Tagungsband: 12. Internationales Symposium der Informationswissenschaft (ISI 2011), Hildesheim, Germany, March 9-11 (2011) 8. Faaß, G., Heid, U., Schmid, H.: Design and application of a Gold Standard for morphological analysis: SMOR in validation. In: Proceedings of the 7th international Conference on Language Resources and Evaluation (LREC 2010), pp. 803–810 (2010) 9. Fitschen, A.: Ein Computerlinguistisches Lexikon als komplexes System (PhD Dissertation). AIMS – Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung, vol. 10. Lehrstuhl für Computerlinguistik, Universität Stuttgart, Stuttgart (2004) 10. Gonzales, A., Barr, V.: Validation and verification of intelligent systems – what are they and how are they different? J. Exp. Theor. Artif. Intell. 12(4), 407–420 (2000) 11. Harris, L.E.: Prospects of Practical Natural Language Systems. In: Proceedings of the 18th Annual Meeting of the Association for Computational Linguistics, p. 129 (1980) 12. H.R. (ed.): Linguistische Verifikation. Dokumentation zur Ersten Morpholympics 1994. Niemeyer, Tübingen (1996) 13. Hinrichs, M., Zastrow, T., Hinrichs, E.: WebLicht: Web-based LRT Services in a Distributed eScience Infrastructure. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 489–493 (2010) 14. International Standard ISO/IEC 9126: Information technology – Software product evaluation – Quality characteristics and guidelines for their use. ISO, Geneva (1991) 15. King, M., Underwood, N.: Evaluating symbiotic systems: the challenge. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), pp. 2475–2478 (2006) 16. Kurimo, M., Varjokallio, M.: Unsupervised morpheme analysis evaluation by a comparison to a linguistic gold standard – Morpho Challenge 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, Springer, Heidelberg (2009) 17. Kurimo, M., Virpioja, S., Turunen, V.T., Blackwood, G.W., Byrne, W.: Overview and results of Morpho Challenge 2009. In: Peters, C., Di Nunzio, G.M., Kurimo, M., Mostefa, D., Penas, A., Roda, G. (eds.) CLEF 2009. LNCS, vol. 6241, pp. 578–597. Springer, Heidelberg (2010) 18. Lehmann, S., Oepen, S., Regnier-Prost, S., Netter, K., Lux, V., Klein, J., Falkedal, K., Fouvry, F., Estival, D., Dauphin, E., Compagnion, H., Baur, J., Balkan, L., Arnold, D.: TSNLP – Test Suites for Natural Language Processing. In: Proceedings of the 16th International Conference on Computational Linguistics, vol. 2, pp. 711–716 (1996)

66

G. Faaß

19. Mahlow, C., Piotrowski, M.: A Target-Driven Evaluation of Morphological Components for German. In: Searching Answers – Festschrift in Honour of Michael Hess on the Occasion of His 60th Birtday, pp. 85–99. MV-Verlag, Münster (2009) 20. Manzi, S., King, M., Douglas, S.: Working towards User-oriented Evaluation. In: Proceedings of the International Conference on Natural Language Processing and Industrial Applications (NLP+IA 1996), pp. 155–160 (1996) 21. Schiller, A.: Deutsche Flexions- und Kompositionsmorphologie mit PC-KIMMO. In: Hausser, R. (ed.) Linguistische Verifikation. Dokumentation zur Ersten Morpholympics, pp. 37–52. Niemeyer, Tübingen (1996) 22. Schiller, A., Teufel, S., Stöckert, C., Thielen, C.: Vorläufige Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical report, Universität Stuttgart, Institut für maschinelle Sprachverarbeitung, and Seminar für Sprachwissenschaft, Universität Tübingen (1995) 23. Schmid, H.: A programming language for finite state transducers. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds.) FSMNLP 2005. LNCS (LNAI), vol. 4002, pp. 308–309. Springer, Heidelberg (2006) 24. Schmid, H., Fitschen, A., Heid, U.: A German Computational Morphology Covering Derivation, Composition, and Inflection. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), pp. 1263–1266 (2004) 25. Sparck Jones, K., Galliers, J.R.: Evaluating Natural Language Processing Systems. LNCS (LNAI), vol. 1083. Springer, Heidelberg (1996) 26. Spiegler, S., Monson, C.: EMMA: A novel Evaluation Metric for Morphological Analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1029–1037 (2010) 27. Thompson, B.H.: Evaluation of Natural Language Interfaces to Data Base Systems. In: Proceedings of the 19th Annual Meeting of the Association for Compuational Linguistics (ACL 1981), pp. 39–42 (1981) 28. Underwood, N.L.: Issues in Designing a Flexible Validation Methodology for NLP Lexica. In: Rubio, A., Gallardo, N., Castro, R., Tejada, A. (eds.) Proceedings of the First International Conference on Language Resources and Evaluation, vol. 1, pp. 129–134 (1998)

HFST—Framework for Compiling and Applying Morphologies Krister Lindén, Erik Axelson, Sam Hardwick, Tommi A. Pirinen, and Miikka Silfverberg University of Helsinki Department of Modern Languages Unioninkatu 40 A FI-00014 Helsingin yliopisto, Finland

Abstract. HFST–Helsinki Finite-State Technology ( ) is a framework for compiling and applying linguistic descriptions with finite-state methods. HFST currently connects some of the most important finite-state tools for creating morphologies and spellers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications in key environments and operating systems. HFST also provides an opportunity to exchange transducers between different software providers in order to get the best out of each finite-state library. Keywords: Finite-state libraries, finite-state morphology, natural language applications.

1 Introduction Including language technology in ordinary software applications can be both laborious and expensive if every language needs its own piece of software in addition to its lexicons and grammars. Standard interfaces to language modules have long been an effort equally promoted and hampered by large organizations by each having their own standard. However, with finite-state technology it is possible to go further and encode a range of language modules as finite-state transducers with a unified access interface. To create the transducers, we still need lexicons and grammars for each language. Lexicons and grammars exist for close to a hundred languages with various degrees of coverage and elaboration. There are ongoing efforts to collect and list available sources in various software and data registries, e.g., VLO1 , META-SHARE2 as well as more specific efforts for listing open-source finite-state descriptions on the HFST–Helsinki Finite-State Technology web site3 . 1 2 3

C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 67–85, 2011. c Springer-Verlag Berlin Heidelberg 2011

68

K. Lindén et al.

Finite-state transducer (FST) technology has been well-known for several decades, and many packages of FST calculus exist. Some of them are available as open source. If they lose support, changing to another may easily mean redeveloping the language description. The primary goals of HFST are to unify the field and create a framework for developing, compiling and applying morphologies, to create convergence and cooperation within the community which develops finite-state calculus and tools, to create a neutral platform where different implementations of the finite-state calculus can coexist and compete with each other, and to create a critical mass of research for improving the basic algorithms of the calculus, and compilation algorithms. HFST does this by providing an interface to an increasing number of software libraries for processing finite-state transducers in specialized ways. As far as possible, we have tried to avoid implementing yet another finite-state calculus. We have rather utilized existing free open source implementations, e.g., SFST by Helmut Schmid [22] and foma by Måns Huldén [10] for transducers without weights, as well as OpenFst by M. Riley, J. Schalkwyk, W. Skut, C. Allauzen and M. Mohri [2] for weighted transducers. A structural layout of how this compatibility is accomplished in HFST is available in section 2. A second set of goals for HFST is to create and collect readily available open-source morphologies in order to provide a platform for basic high-performing natural language processing tools, to stimulate the production of free open-source software for compiling dictionaries, grammars and rules into FSTs, and to stimulate the production of language resources (e.g., dictionaries, grammars, rules) to be compiled into FSTs. HFST offers compatible open-source tools for compiling such language descriptions as well as storing them in formats that can be exchanged and further processed by the different finite-state libraries available in HFST, see section 3. An early adopter of the open-source paradigm was the SFST–Stuttgart Finite-State programming language (SFST-PL) which also has attracted developers of full-scale lexicons for various languages, e.g., German, Finnish, Turkish, or Italian. The benefits of HFST are demonstrated by using the foma library to implement the SFST-PL, which reduces the compile time to a fraction of the original. For further details on this, see section 4. Over time, the Xerox commercial finite-state environment with compilers like TwolC, LexC and XFST has become popular and many academically developed language descriptions are available for these tools. This set of tools is now supported by HFST as a set of open source tools with an additional tool for composing parallel twolevel rule sets with a lexicon. This is outlined in section 5. We provide some insight into our compact and high-speed runtime transducers allowing processing speeds of more than 100,000 tokens per second using roughly one percent of the size of a corresponding uncompacted file, in section 6. Some of the open-source application areas, where HFST is already in use, e.g., spellchecking through Voikko for OpenOffice, machine translation preprocessing via Apertium and part-of-speech tagging using parallel weighted transducers, are outlined in section 7. HFST supports both commercial and open source applications. The tools and libraries of HFST are compatible with the GNU GPL license, which means that any FST

HFST—Framework for Compiling and Applying Morphologies

69

produced with HFST tools will remain under the licensing conditions of the input lexicons and rule sets. The HFST runtime format and runtime library are additionally released under the Apache license. This means that the HFST tools and the HFST runtime library can be used both for open source and proprietary projects. A further discussion of the impact of the HFST environment and some of the open-source morphologies currently available is provided in section 8.

2 Structural Layout Finite-state transducer libraries such as SFST [22], foma [10] and OpenFst [2] all provide different ways for creating transducers, e.g., foma emulates the XFST formalism and SFST contains a compiler for its own regular expression formalism. The lack of common formalisms makes it difficult to compare the performance of the different libraries reliably, since they cannot be used for computing the same tasks. One of the original goals in the HFST project was to provide a framework where it is possible to compare the performance of different finite-state transducer libraries on the same tasks. HFST 2.0 provided limited support for this by joining SFST and OpenFst under one interface. Regrettably, adding new libraries was cumbersome in HFST 2.0. To remedy this, HFST 3.0 was designed to make it practical to add both complete and partial implementations of transducer libraries and to use these for compiling LexC lexicons as well as TwolC, XFST and SFST-PL grammars. Related to the goal of comparing performance, is the goal of combining algorithms from different transducer libraries in one task. This is now possible in HFST 3.0, where transducers can be converted between different underlying libraries. Thus it is possible to utilize well implemented algorithms from a variety of libraries in order to achieve faster compilation times. It is also possible to test the effect of a single transducer operation on the total compilation time of a task by switching between operations from different specialized libraries, i.e., it is possible to create specialized data-structures for certain operations in a separate library and use them for some special purpose operation without the need to re-implement a full set of well-researched basic transducer operations. We first present the general structure of the HFST transducer class library and then we outline the procedure for adding a new or specialized library. We then mention the coding principles for exception handling in HFST and how new libraries can be linked and tested. 2.1 General Layout of HFST Transducers in HFST are objects of the class . The class supports all the ordinary transducer operations like disjunction, composition and automaton determinization. encapsulates the different transducer libraries under the HFST interface, i.e., the same code will work for all transducer libraries which are part of HFST. For example, the function in figure 1 works equally well for transducers whose underlying implementation is an SFST or a foma transducer.

70

K. Lindén et al.

!

"

#$! "% & #$

! "

! " '!"%

Fig. 1. Function computes the language , assigns it to center and returns a reference to center

Internally objects contain pointers to specific transducer library implementations. These are wrappers on the actual transducer libraries. Currently there are four wrappers for three libraries SFST, OpenFst and foma. The tropical weight semi-ring and log weight semi-ring in OpenFst have separate wrappers. An can be initialized using any of the three transducer libraries, which are available, and it is possible to convert between the libraries at runtime. A wrapper for a library consists of input and output stream classes for reading and storing binary transducers and a transducer class. The wrappers encapsulate implementation specific details in the different transducer libraries thus providing the HFST interface a unified way to manipulate objects from the different transducer libraries. For example, the wrappers for SFST are the classes , and . Internally each object points to one of the implementations e.g., . When is called, the in gets called. In binary operations like , there is risk for a transducer type mismatch, since the objects involved may have different types. In case they do, an exception is thrown. This can be caught and the transducers can be converted to a common type. The concatenation can then safely take place. However, in HFST 3.0, we have made a conscious choice not to convert transducers automatically in binary operations because this might lose information, e.g., when converting from weighted to unweighted formats. 2.2 Adding New Libraries to HFST The task of adding a new transducer library to HFST can be broken into three subtasks. 1. Building the wrappers , and . 2. Adding conversion functions to/from objects from/to the HFST internal format . 3. Adding necessary declarations in the master interface files and in order to use the interface functions properly. Unless has an alphabet implementation which associates string symbols with symbol numbers, such an alphabet also has to be created.

HFST—Framework for Compiling and Applying Morphologies

71

There is also support in the current HFST interface for transducers with weighted semi-rings which can not be represented as floating point numbers, although this may require implementing some new functions. 2.3 Coding Principles Exceptional situations occur in computer programs when the user does something unexpected or there is a bug in the code. Examples of user-originating situations in a finite-state library include: 1. The user tries to read a binary transducer from a file that contains a text document or does not exist. 2. A transducer in the AT&T text format has a typo on one line and the line cannot be parsed. 3. The user calls a function without checking the preconditions, e.g., tries to extract all paths from a cyclic transducer. Throwing an exception on such occasions gives the user a possibility to catch the exception and recover from the situation. In HFST version 2.0, exceptional situations were handled by printing a short message on the standard error stream and exiting with an error code. In HFST version 3.0, exceptions are classes that have a figurative name and contain an optional error message. For the above scenarios, HFST will throw the following exceptions: 1. ! or " # ! and, in the error message, the name of the file or stream. 2. $% &' ! and, in the error message, the line that could not be parsed. 3. "% ! . The user could react to the exceptions in the following ways: 1. Check that the file exists and contains transducers and try again with the correct file. 2. Fix the typo in the text format. 3. Call another function that limits the number of paths extracted from the transducer. Exceptions are also used internally in the HFST library for reporting to a calling function that something unexpected happened. The calling function can handle the situation itself or inform the user and suggest what they should do, which means that the execution of the program may continue or terminate gracefully. An '% ! is thrown when it is unlikely that the user can handle the exception. The user should instead report the exception and its circumstances to the HFST developers, because the exception is essentially a bug that must be fixed. Some assertions are also used for internal checks. When an assertion fails the user should similarly report the failure as a bug.

72

K. Lindén et al.

2.4 Dynamic Linking to Underlying Libraries There has been considerable progress in achieving the HFST goal of acting as a compatibility layer between different representations of finite-state transducers and, more importantly, the operations and formalisms (e.g., LexC, TwolC, XFST, SFST-PL) that have been implemented for them. HFST is now independent of any particular library and requires no custom extensions to the libraries it uses. Previously, HFST relied on custom extensions to the libraries it supported, namely OpenFst and SFST, which made it necessary to statically link them into %. This was obviously also rather restrictive in terms of new versions and different use cases, e.g., experimenting with local changes to the underlying libraries. HFST 3.0 supports conditional compilation of all its elements that provide interfaces to underlying libraries, and dynamic linking is done to whichever libraries the user configures HFST 3.0 to use. Due to the recent improvements, HFST 3.0 can also be built without any external libraries in which case the code supports only the operations for a simple internal representation and optimized-lookup, cf. section 6, i.e., building from text representation and fast lookup. If the user’s own library is included, the code will also support compilation into the optimized-lookup format.

3 Data Formats Each library is also expected to be able to handle its own external binary data format. For handling the external binary data of a specific library correctly with the HFST command line tools, the external binary data files are prepended with a header. Below, we outline the structure of this header. Each of the underlying libraries of HFST is expected to have its own internal data format. As mentioned in the previous section, when adding a new library, conversion functions to and from the library-specific internal format need to be provided. We outline the procedure for this which normally is linear in time. In addition, we may need to create transducers which deal with symbols that have not yet been specified in their alphabet. For this reason, we also need functions for harmonizing the alphabets between two or more transducers of the same format. We give the preconditions for this operation and refer the interested reader to the literature. 3.1 Transducer Binary Format An HFST transducer in binary format consists of an HFST header followed by the back-end implementation in binary format4 . In version 3.0, the header format is less error-prone than in the previous versions as it gives more information both for users of HFST, when seen on a screen or in a text editor in binary format, and for the HFST library itself. The current header format is somewhat similar to foma where pieces of information are separated by newline characters to make them more readable. In HFST version 2.0, 4

( )*

HFST—Framework for Compiling and Applying Morphologies

73

we represented the properties of a transducer in a two-byte bit vector akin to the OpenFst header format. The type of the transducer and the existence of an optional alphabet in the transducer were also encoded with two characters in the beginning of the binary transducer which required familiarity with the specifications to interpret. The new header format makes it easier to react to unexpected situations and inform the user, if necessary. When we read an HFST binary transducer, we first see whether an identifier ‘HFST’ is found. If not, we know that the user has given an incorrect file type and can throw an appropriate exception. Next we recognize the implementation type of the transducer. If the back-end transducer library is not linked to HFST, we can handle the situation by throwing another exception. Then we recognize the version of the header in order to process the rest of the header and the back-end implementation correctly. If the user has requested a verbose mode for a tool that is reading the transducer, it is also possible to print the name of each transducer before or after reading it. 3.2 Conversion between Different Back-End Formats In HFST version 3.0, the conversion between different back-end formats, i.e., SFST, foma or OpenFst with tropical and logarithmic semi-ring, is carried out through the prorpietary HFST transducer format, (. The HFST internal format is a simple transition graph data type that consists of states (unsigned integers) and transitions between those states5 . We have chosen to implement ( for two reasons. Firstly, it serves as an intermediate transducer format in conversions, thus reducing the number of conversion functions to 2 × N from N × (N − 1), where N is the number of different transducer back-end formats. Secondly, it is easy to implement functions for ( that allow the user to construct transducers from scratch and iterate through their states and transitions. Implementing such features for an existing transducer library, can sometimes require modifications of the library if the library is designed to be used on a higher level of abstraction, e.g., in SFST and foma, the functions that operate on states and transitions were protected and not well-documented. ( is a class template with template parameters and ). defines the type of transition data that a transition uses and ) the weight type that is used in transitions and final states. ( contains two maps. One maps each state to a set of the state’s transitions that are of the type % *% +. The other maps each final state to its final weight that is of type class )). Class must use the weight type ). A state’s transition % *% + contains a target state and a transition data field that is of type class . Actually, ( is not a transducer but a more generalized transition graph that can contain many kinds of data in its transitions. Currently, the HFST library offers the specializations and for ( and . These specializations are designed for weighted transducers. The weight class ) is a float and the transition data class

5

( +

74

K. Lindén et al.

contains an input string, an output string and a weight of type float. The specializations and are used when converting between different transducer back-end formats. The class template ( is designed so that it can easily be extended to different kinds of transition data types. For example, if HFST tools are used in text-to-speech or speech-to-text conversion, the weights may be more complex and the symbol type of the transitions will probably be something else than strings. 3.3 Alphabet The alphabet of a transducer consists of all symbols (strings) that are known to that transducer. The alphabet includes all symbols that occur or have occurred in the transitions of the transducer unless explicitly removed from the alphabet. If we apply a binary operation (e.g., disjunction or composition) on transducers A and B, the resulting transducer’s alphabet will include all symbols that were in the alphabets of A and B. In HFST version 2.0, alphabets were designed to be external to the transducer, and the interface did not offer any convenient way for the user to access a transducer’s internal representation of the alphabet. It was up to the back-end implementation to take care of the alphabet of an individual transducer. In SFST the transducers always have an explicit alphabet, but in OpenFst their use is optional. In HFST version 3.0, we need to be aware of transducer-specific alphabets because two new special symbols are included, , - and . These special symbols are a part of the Xerox Finite-State Tool (XFST) formalism [5] and they are also implemented in foma [10]. The , - and symbols are useful when we want to refer to all symbols that are not currently known to a transducer but which the transducer can later become aware of. Supporting , - and symbols in all HFST back-end implementations has enabled us to provide an XFST compiler that can be used with all back-end implementations. In this way, we can offer users of HFST a new formalism for regular expressions in addition to the one available in SFST. As the extension is already implemented in foma, we only needed to consider SFST and OpenFst in HFST 3.0. Aside from keeping track of all symbols known to an individual transducer, we also have to expand each transition involving , - and symbols into a set of transitions every time we apply a binary operation on two transducers. This is because the transducer becomes aware of new symbols that are no longer unknown and thus no longer included in the , - or symbols. Fortunately, this expansion can be done before the operation itself (and for composition before and after the operation itself), i.e., it is not necessary to make changes in the operations of the back-end transducer libraries. The library operations can and will handle the special symbols just like any ordinary symbols. First we iterate through the alphabets of both transducers and find out which symbols in the alphabet of one transducer are not found in the alphabet of the other transducer and vice versa. Then we add beside each transition involving the , - or symbols a set of transitions, where , - and symbols are replaced with all symbols that the transducer just became aware of. For more information on how to expand special symbols, see [10] and [5].

HFST—Framework for Compiling and Applying Morphologies

75

It is also possible to switch off the handling of special symbols if we know for sure that they are not used in the transducers. In this way, we can optimize performance for instance for the tool %. that processes the SFST-PL formalism which does not support the unknown or identity symbols.

4 SFST Programming Language Compatibility The performance of HFST has improved from version 2.0 to 3.0. We compiled two finite-state morphologies in the SFST-PL format with HFST versions 2.0 and 3.0. The morphologies were OMorFi [17] for Finnish and Morphisto [27] for German. Table 1 shows the compilation times for both morphologies with different back-end implementations with both versions of HFST. Note that the foma implementation was not available in version 2.0. Table 1. Compilation times for Finnish and German morphologies with different HFST versions. The times are expressed in minutes and seconds. Back-End

HFST

Finnish

German

SFST

2.0 3.0

25:16 5:02

107:47 6:39

OpenFst

2.0 3.0

7:54 6:51

6:23 6:28

foma

2.0 3.0

— 1:49

— 1:29

We can clearly see that the compilation time has improved dramatically for the SFST implementation. This is mainly because the new version of SFST, 1.4.2, uses Hopcroft’s minimization algorithm [9] instead of Brzozowski’s [6]. We noticed that the Brozowski minimization algorithm hampered the performance already when we were testing HFST version 2.0; OpenFst was clearly faster because it used the Hopcroft algorithm. Based on this observation, Helmut Schmid improved SFST by writing a minimization function using the Hopcroft algorithm. When comparing the compilation times for OpenFst, we see that the Finnish morphology compiles faster but the German one slightly slower on HFST version 3.0 than on version 2.0. This is because there are two main factors that contribute to the difference in performance. Firstly, we are currently using OpenFst version 1.2.7 that is faster than the previous versions. Secondly, in HFST version 3.0 we no longer use a global number-to-symbol encoding for all transducers during one session. Every time we perform a binary operation on two transducers, we harmonize the encodings of the transducers. Nevertheless, it seems that the newer, more efficient version of OpenFst mostly compensates for this additional effort caused by harmonization. We did not have the foma implementation of the SFST-PL available in HFST version 2.0, but it is evident that it is much faster than the other implementations in either

76

K. Lindén et al.

version of HFST. Foma does not use a global symbol-to-number encoding in its transducers either, but it still performs well. This is evidence that symbol harmonization is not a big factor in the compilation times of morphologies.

5 Xerox Compatibility Among the goals of the HFST framework has always been to retain legacy support for the Xerox line of tools for building finite-state morphologies [5]. Optimally, the tools should be familiar to end-users converting from the Xerox tools. For this purpose we have aimed to create clones of the most important Xerox tools as accurately as possible. Previous open-source implementations of Xerox tool clones have included LexC and TwolC [13] and LexC and XFST [10]; in HFST 3.0 we have combined these contributions into one uniform package capable of handling the full line of Xerox tools for morphology. For the most part, end-users will require no other familiarization than changing program names in order to start using HFST tools for their Xerox-style language description needs. The implementation of the XFST scripting language makes heavy use of the new foma back-end, which already had a good coverage of XFST features. The only additions in HFST are the ones required for interoperability between other back-ends and HFST internals. Particular care was taken not to duplicate the work already present in foma and its tools. Similarly HFST’s LexC parsing engine was replaced by the faster LexC parser in foma, again with HFST interoperability tweaks straddling the gaps. For practical examples of specific previously implemented Xerox style finite-state language descriptions we provide a wiki-based web page6 . Another repository of such language descriptions is located at the University of Tromsø’s subversion repository7. Of these, the morphologies for the Sámi languages and Greenlandic are regularly used in regression and stress tests of the HFST tools. For specific functionalities, the Xerox tools perform various kinds of special processing of finite-state transducers beyond the range of standard finite-state algorithms. A prominent example of this is the handling of special symbols such as flag diacritics [4], which would require support from the underlying libraries for many finitestate operations to work as they do in Xerox tools when using %/ % and %/ settings. HFST tools provide support for such options, and provide fall-back processing where back-end libraries lack support for the required operations. Fall-back support commonly involves converting the back-end library internal transducers to the HFST internal format, calculating the operation and converting the transducer back to the back-end library format. 5.1 Intersecting Composition Intersecting composition is used for applying a grammar of two-level rules to a twolevel lexicon. The result of the operation is, e.g., a morphological analyzer mapping word forms to analyses. Compiling the analyzer using conventional methods requires 6 7

( )*, ( $

HFST—Framework for Compiling and Applying Morphologies

77

computing the intersection of the rule transducers. This may lead to a prohibitively large intermediate result. Intersecting composition avoids computing the entire intersection of the rules thus reducing both memory and time requirement. The operation was introduced by Karttunen [11] and later extended to weighted transducers in HFST 2.0 by Silfverberg and Lindén [23]. The intersecting composition operation was implemented in HFST 2.0, but we used techniques adopted from OpenFst [2] to improve the implementation, and the current implementation is significantly faster than the old one. The current implementation computes a lazy pairwise intersection of the rule transducers. The lexicon can be composed with this intersection using a standard composition algorithm. Previous Implementation. The implementation of intersecting composition in HFST 2.0 can be characterized as the composition of the lexicon transducer L with a structure P containing all rule transducers. For simplicity, we assume that the lexicon and rule transducers are deterministic. Outwards the structure P resembles an ordinary transducer with states and transitions. The states of P internally correspond to vectors of rule transducer states, which we call state configurations. The vectors have as many indexes as there are rules and each rule corresponds to a unique index, where its state is stored. For example, the start state of P corresponds to the vector containing the start states of the rules. Initially only the start state of P is computed. More states in P are computed according to the transitions in L. For example, the lexicon L might have a transition with output-symbol in its initial state. In order to compute the intersecting composition, it would be necessary to create transitions and corresponding target states in P for all symbol pairs 0, where each of the rules has transitions from its initial state with symbol pair 0. Outwards P would have one target state t for the transition with pair 0 from its initial state. Internally t would correspond to a configuration of the target states of the transitions with symbol pair 0 in each of the individual rules. There is no caching of transitions in the states of P. Thus the transitions in states have to be recomputed every time the algorithm visits a given configuration of rule states. This requires more work than if the transitions were cached, since there usually exist state configuration which are very frequently visited. Even if the transitions in states were cached, this implementation would still be suboptimal, since a new state configuration always requires re-examining the transitions in all rules. This is true even if only one rule state differs from a previous configuration. Current Implementation. We note that phonological two-level rules usually track sound changes in fairly specific contexts. This means that when composing two-level rule transducers with a lexicon, the rules will occupy a limited state set during the majority of the time of the composition. The current implementation of intersecting composition capitalizes on this property. Instead of a parallel lazy intersection like we used in HFST 2.0, we recursively build an intersection of the rules by intersecting them lazily pairwise. The first and second rule are intersected, the result is intersected with the third rule and so on. The HFST 3.0 implementation caches the transitions of a given state pair, so there is no need to recompute them when the state is revisited.

78

K. Lindén et al.

Table 2. Runtimes for intersecting composition of the Finnish and Northern Sámi morphological analyzers in HFST 2.0 and HFST 3.0 Language North Sámi Finnish

HFST 2.0

HFST 3.0

364.2 s 4.1 s

63.4 s 2.6 s

This leads to significant improvement in performance, see table 2. The improvement results from caching transitions, but it is improved by the fact that computing the transitions in a previously unseen state configuration does not require recomputing the transitions in all of the rules. For example, if rule number n moves to a new state with symbol pair 0, but the rest of the rules remain in a familiar state configuration, we only need to recompute the transitions of the rules having a greater index than n. This derives from the fact that we have already cached the target state of the subset of rules 1 to n − 1 in their lazy intersection structure. Like the old implementation, the current implementation of intersecting composition is equivalent with Xerox style flag diacritics and symbols.

6 Runtime and Optimized Lookup Optimized-lookup is an HFST-specific binary format for finite-state transducers providing fast lookup. First documented in [24], its implementation has evolved somewhat to meet the requirements of specific applications. The format has also found new application in on-the-fly operations in transducers, e.g., composing lookup for spell-checking and correction, see section 7.2. 6.1 Implementation and Integration in HFST 2.0 and HFST 3.0 Optimized-lookup was supported in HFST 2.0 by the standalone utilities %

, (compilation) and %

, (non-tokenizing lookup). This was partly in service of the goal of giving optimized-lookup the widest possible range of uses; %

, was released under the Apache License [3], whereas HFST 2.0 proper was released under the potentially more restrictive GNU Lesser Public License [7]. Provision was later made for both unweighted and weighted (with log weights) transducers, and flag diacritics [5], see also section 6.3. Other Implementations and Applications. Demonstrations of the lookup facility were also produced in Java and Python, two popular and accessible programming languages, in the hope of facilitating and spreading use of the format. This effort bore fruit in the incorporation of the Java code in a project for anonymizing identities in legal documents at the Aalto University in Helsinki. The format saw additional uses and implementations over the course of furthering research goals and maintaining HFST 2.0. Hyphenators and spell checking transducers were primarily used in this format for its speed, and in 2010 a Google Summer of

HFST—Framework for Compiling and Applying Morphologies

79

Code8 project by Brian Croom produced , a tokenizing lookup application for optimized-lookup which was put to use in various text stream processing scripts (e.g., % for analysis, / for generation and for hyphenation). HFST 3.0. Originally compilation to the format was only possible from the SFST and OpenFst formats, and as uses and applications proliferated, it became desirable to provide some API access to optimized-lookup in the HFST library. In HFST 3.0 this has been accomplished by implementing compilation from the HFST internal transducer format, allowing for a great degree of integration with HFST 3.0 supported tools. 6.2 Index Table Compaction The crucial idea behind optimized-lookup is Liang compaction, as described in [24]. It allows for the representation of a transducer as a lookup table, with entries for each symbol in the alphabet for each state in the transducer, without growing to the prohibitive sizes such a design would imply, i.e., a multiple of the number of states and the number of symbols for the state indexing table alone. Liang in his PhD thesis on hyphenation [12] did not specify a generalized compaction scheme, only the requirements for its correctness. In realistic transducers, finding the optimal compaction is in any case computationally infeasible, and consequently some approaches to producing a “good enough” compaction have been attempted. For the purposes of this article, the task of compacting the index table may be summarized as the following. Given N arrays (one for each state) s of length L (the size of the alphabet), the entries of which are 0 (representing an unused entry) or 1 (representing a used entry), calculate a list of starting indexes I1 , I2 . . . IN such that a result array with entries Ri = ∑ s p (q)[I p + q = i] (1) p,q

will also have entries 0 or 1. The brackets [ and ] are Iverson brackets; the value of the bracketed expression is 1 if the condition is true and 0 if it is false. This array will contain all N arrays superimposed on each other in such a way that each entry in the result array is used by no more than one of the original arrays. An optimally compacted index table corresponds to the shortest result array R. A simple strategy is to iterate through the arrays in some order, assigning the lowest possible starting index to each one. For transducers of an appreciable size this process can become slow, as the result array will typically have some zeros in practically all its regions, so a large number of possibilities have to be checked. This problem can be mitigated by applying a head filter, disregarding the largest region of the result array R(1 . . . r) with that region having a density of 1 entries greater than some predefined limit. A limit of 1.0 corresponds to having no filter at all; in practice limits in the region 0.8 . . . 0.9 have proved reasonable, see table 3. For the order in which the arrays are traversed, ordering the states from greatest density of 1 entries to lowest and simply ordering them in numerical order have 8

(

80

K. Lindén et al.

Table 3. Index table sizes for various values of the head filter limit using in-order traversal as applied to the Morphalou project’s French morphology and released on the HFST site on 201004-14 Head filter limit

Number of index entries

(No compaction) 0.0 0.1 0.3 0.5 0.7 0.8 0.9 1.0

4,541,879 1,107,321 408,843 206,152 170,789 155,703 148,740 140,795 135,285

been tried. These approaches don’t appear to produce dramatically different results9 ; also in theory, either approach could produce better results than the other. Work is ongoing in finding the best practical filter strategies, filter limits and traversal orders, and potentially different compaction strategies. A representation of the index table with no compaction would require N × L entries. For the sake of comparison, in the case of an analyzing transducer from OmorFi10 , a Finnish finite-state morphology, this is 203, 851 × 155 = 31, 596, 905 and would produce a binary of almost 200 megabytes, whereas the compacted index table has 364,980 entries, or about one percent of the uncompacted form, for a binary of 7 megabytes. 6.3 Flag Diacritics and Related Optimization Tricks Support for flag diacritics was added to the %

, utility with an eye to efficiency, prompting some refinements to the HFST implementation of the format itself. Flag diacritics are parsed prior to lookup and during it they restrict the lookup search tree. This is critical for speed, as the alternative of calculating all the outputs first and then removing outputs with conflicting flag diacritics can, in the case of transducers with liberal use of flags, involve several times the work. For the purposes of traversing transitions, flag diacritics are essentially a special case of the epsilon symbol. If the configuration of flags that have been traversed up to a certain point allow it, each transition with a flag diacritic is traversed without reading an input symbol. With this in mind, it was desirable to avoid checking for transitions with each flag symbol in each state. The optimized-lookup format was therefore amended to treat flag diacritics as epsilon for purposes of constructing the index table and to list their transitions out of the normal order, after the epsilon transitions. Thus it is possible to check only those flag transitions that are present in a given state. This allows further reductions in the size of the index table; as the indexes for flag diacritics are no longer in use, it is possible to reduce the effective size of the input 9 10

Of course, in practice these orderings will often be similar. We used the binary released on the HFST site on 2010-10-14.

HFST—Framework for Compiling and Applying Morphologies

81

alphabet by the number of different flag diacritic operations. These improvements may appear minor, but are not insignificant in transducers that make substantial use of flag diacritics. In the previously discussed version of OmorFi, we achieved a reduction of 12.4% in index table size.

7 Application Areas After the initial release of the HFST platform, it has been used in several end-product applications. The two most prominent uses are as a part of the rule-based machine translation platform Apertium11 and the spell-checking library Voikko12. Both of these linguistic applications benefit hugely from the fact that there were previous language descriptions available written in the Xerox finite-state morphology formalism, and integrating HFST in these applications gave developers of those language descriptions a direct conversion path to two application types that had not been available in the Xerox framework. Since both applications, as well as the HFST framework, are free/libre open-source software, the integration of the existing language descriptions in the projects was possible. One of the most pressing reasons for extending finite-state support to these applications is the lack of language support and the lack of a theoretically well-motivated open-source option for language support for morphologically more complex languages in the above-mentioned applications. For example in the field of spell-checking the theoretical upper-bound for hunspell—de facto standard in the open-source market—is a mere 4 affixes. For polysynthetic languages like Greenlandic, it simply is not possible to precompute enough affix and stem combinations, as has been done with e.g., Hungarian. The Xerox style finite-state morphology demonstrably supports at least Hungarian and a wide variety of other morphologically varied languages [5]. Another rationale for extending finite-state methods to the application areas is that the efficiency and expressiveness of finite-state automata is well-known and has been researched e.g., in [1], which makes it a good choice for various text-processing tasks. 7.1 Apertium Interoperability—Corpus Processing Tools and I/O Formats In Apertium, the finite-state automata are used to perform morphological analysis and generation for both parsing running text and generating the translations after performing a mid-shallow rule-based transfer. The HFST software is only one possible morphological analyzer, so the crucial part for inclusion was to get the HFST analyzer to work as the competition. This included two functions: reliable tokenization based on the dictionary data and support for Apertium I/O formats. The contribution of the corpus processing functionality is contained in a tool called also included in the HFST toolkit. The name is influenced by similar corpus processing tools in other toolkits, specifically / from VISLCG313 and % from Apertium itself. ( ( 13 ( + 11 12

82

K. Lindén et al.

For tokenization the FST-based dictionaries are useful, since the process of analysis and lookup can both be performed by basic FST traversal. The specific implementation of analysis and tokenization with a single FST traversal was implemented as a Google Summer of Code project, based on previous studies on the topic [8]. The basic programming logic of the automata traversal for the longest match is trivially extended by the processing of flag diacritics and weights. The I/O format requirements for the Apertium platform are based on the needs to translate existing documents containing all kinds of markup and rich text formats, such as HTML for web pages or MediaWiki codes from Wikipedia. To achieve this, Apertium uses text encoding and decoding mechanisms and an interchange format called Apertium stream format, whose input and output was implemented in the HFST corpus processing tools. 7.2 Voikko and HFST Based Spell-Checker Formulation The application of finite-state morphologies in spell-checking applications is also based on a new development of finite-state algorithms. The application framework including HFST spell-checkers is the Voikko library, which provides spell-checkers for OpenOffice.org/LibreOffice, the GNOME desktop (via enchant), Mac OS X (via SpellService) and the Mozilla application suite. The finite-state formulation of a spell-checking system was developed based on previous research. In this research it has been shown that a finite-state based natural language description is usable as spell-checker with specialized fuzzy traversal algorithms [16,10] or by using a special (weighted) two-tape automaton as an error model and regular finite-state composition to map misspelled words to their possible corrections [21,20]. The basic finding here is that typical finite-state language descriptions are usable as spell-checking dictionaries with minor to no modifications. Furthermore it has been shown that existing non-finite-state spell-checking dictionaries from hunspell and myspell can be converted into finite-state form [19] providing full backwards compatibility for traditional spell-checking systems. Furthermore, we have optimized the application of error models when suggesting corrections by applying a three-way composition with both the dictionary, the error model and the misspelled word in one operation. This significantly reduces the space and time requirements by leaving many of the impossible intermediate results uncalculated. 7.3 Statistical Part-of-Speech Tagging Using Weighted Finite-State Transducers HFST 3.0 has also been applied to part-of-speech tagging of the Finnish, Swedish and English Europarl corpora [25] as well as the Wall Street Journal corpus [26]. Silfverberg and Lindén implemented first order Hidden Markov Models (HMM) as sets of parallel weighted finite-state transducers using HFST 3.0 tools. Like standard HMMs, their tagger used tag sequences. In addition they included lemmas in their models. Part-ofspeech tagging was accomplished by intersecting composition of a sentence automaton with the set of tagger transducers.

HFST—Framework for Compiling and Applying Morphologies

83

8 Discussion In parallel with HFST, there is the OMor project14 for creating and/or compiling opensource morphological analyzers for Finnish, Swedish, French, German and English. OMorFi [18] is a large-scale Finnish open-source morphological transducer lexicon based on words from a dictionary with inflectional codes as well as patterns for compounding and derivation. It is available both for the SFST-PL as well as a Xerox-style two-level morphology. The Divvun project in Norway has created two-level morphological analyzers for Northern and Lule Sámi using Xerox tools. In addition, there are several projects having developed SFST-PL lexicons for German15, Italian16 , Turkish17, etc. Also close to a hundred HFST spellers [19] with an improved error correction mechanism already exist having been compiled from Hunspell sources and large corpora. The effort to collect existing morphological descriptions for various languages is on-going. A concrete outcome of the current HFST environment is that it is now possible to take a lexicon developed with e.g., XFST or Hunspell and weight it with material from a specialized corpus in order to create a tailored speller for a given domain. This can all be accomplished in the course of an afternoon after which we can start using it in e.g., OpenOffice. As future work, we are investigating how to extend the morphological analyzers into finite-state implementations of constraint grammars18 and other dependency-related tagger and grammar formalisms using both statistical and rule-based approaches with weighted finite-state transducers.

9 Conclusion In this article, we have described the structural layout of HFST–the Helsinki Finite-State Technology library, how it connects some existing finite-state libraries and how it can accommodate additional libraries for finite-state algorithms and applications. Facilitating data exchange between finite-state implementations is important for processing and enhancing language descriptions created with different tools and formalisms. In this article, we focused on two finite-state programming language environments, i.e., the SFST Programming Language and the Xerox tools for creating morphological descriptions, and showed that the HFST toolkit can cover a wider range of morphologies and applications than the original finite-state environments. We have also demonstrated that the cross-usage of finite-state programming language front-ends and finite-state library back-ends can provide significant reductions in processing time. Acknowledgments. We would like to thank the contributors of the open source community for their support and excellent test applications. Special thanks go to Sjur Moshagen of Divvun, Francis Tyers of Apertium, and Harri Pitkänen of Voikko. We are also grateful to FINCLARIN for making HFST possible. ( Morphisto [27] 16 ( 17 TRmorph–( - 18 ( + 14 15

84

K. Lindén et al.

References 1. Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, & Tools with Gradiance, 2nd edn. Addison-Wesley Publishing Company, Reading (2007) 2. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: A general and efficient weighted finite-state transducer library. In: Holub, J., Žd’árek, J. (eds.) CIAA 2007. LNCS, vol. 4783, pp. 11–23. Springer, Heidelberg (2007), ( 3. Apache Software Foundation: Apache License, Version 2.0, ( ./0,12,34 4. Beesley, K.R.: Constraining separated morphotactic dependencies in finite-state grammars. In: Karttunen, L., Oflazer, K. (eds.) Proceedings of the International Workshop on Finite State Methods in Natural Language Processing, pp. 118–127. Association for Computational Linguistics, Morristown (1998) 5. Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI Publications, Stanford (2003) 6. Brzozowski, J.A.: Derivatives of regular expressions. J. ACM 11, 481–494 (1964) 7. Free Software Foundation: GNU Lesser General Public License, Version 3, ( 8. Garrido-Alenda, A., Forcada, M.L., Carrasco, R.C.: Incremental construction and maintenance of morphological analysers based on augmented letter transducers (2002) 9. Hopcroft, J.E.: An n log n algorithm for minimizing states in a finite automaton. Tech. rep., Stanford University, Stanford, CA, USA (1971) 10. Huldén, M.: Fast approximate string matching with finite automata. Procesamiento del Lenguaje Natural 43, 57–64 (2009) 11. Karttunen, L.: Constructing lexical transducers. In: The Proceedings of the 15th International Conference on Computational Linguistics, Coling 1994, pp. 406–411. ACL, Morristown (1994) 12. Liang, F.M.: Word hyphenation by computer. Ph.D. thesis, Stanford University (1983), ( 13. Lindén, K., Silfverberg, M., Pirinen, T.: HFST tools for morphology—an efficient opensource package for construction of morphological analyzers. In: Mahlow, Piotrowski (eds.) [14], pp. 28–47 14. Mahlow, C., Piotrowski, M. (eds.): SFCM 2009. CCIS, vol. 41. Springer, Heidelberg (2009) 15. Proceedings of the 18th Nordic Conference of Computational Linguistics, Nodalida 2011, Riga, May 11-13 (2011) 16. Oflazer, K.: Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22(1), 73–89 (1996) 17. Pirinen, T.: Suomen kielen äärellistilainen automaattinen morfologinen analyysi avoimen lähdekoodin menetelmin. Master’s thesis, Helsingin yliopisto (2008), ( - 18. Pirinen, T.: Modularisation of Finnish finite-state language description–towards wide collaboration in open source development of a morphological analyser. In: Nodalida (ed.) [15], 19. Pirinen, T.A., Lindén, K.: Building and using existing hunspell dictionaries and T E X hyphenators as finite-state automata. In: Proccedings of Computational Linguistics – Applications, Wisła, Poland, pp. 25–32 (2010), ( - 5 3464 20. Pirinen, T.A., Lindén, K.: Finite-state spell-checking with weighted language and error models. In: Proceedings of the Seventh SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Less-resourced Languages, Valletta, Malta, pp. 13–18 (2010) 21. Savary, A.: Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction. In: Watson, B.W., Wood, D. (eds.) CIAA 2001. LNCS, vol. 2494, pp. 251–260. Springer, Heidelberg (2003)

HFST—Framework for Compiling and Applying Morphologies

85

22. Schmid, H.: A programming language for finite state transducers. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds.) FSMNLP 2005. LNCS (LNAI), vol. 4002, pp. 308–309. Springer, Heidelberg (2006) 23. Silfverberg, M., Lindén, K.: Conflict resolution using weighted rules in HFST-TWOLC. In: Proceedings of the 17th Nordic Conference of Computational Linguistics, Nodalida 2009, Nealt, pp. 174–181 (2009) 24. Silfverberg, M., Lindén, K.: HFST runtime format—a compacted transducer format allowing for fast lookup. In: Watson, B., Courie, D., Cleophas, L., Rautenbach, P. (eds.) FSMNLP (July 13, 2009), ( - 3447 25. Silfverberg, M., Lindén, K.: Part-of-speech tagging using parallel weighted finite-state transducers. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) IceTAL 2010. LNCS, vol. 6233, pp. 369–380. Springer, Heidelberg (2010) 26. Silfverberg, M., Lindén, K.: Combining statistical models for POS tagging using finite-state calculus. In: Nodalida (ed.) [15] 27. Zielinski, A., Simon, C.: Morphisto: Service-oriented open source morphology for German. In: Mahlow, Piotrowski (eds.) [14], pp. 64–75.

Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing in Machine Translation Esmé Manandise1 and Claudia Gdaniec2

2

1 IBM Thomas J. Watson Research Center, Yorktown Heights, New York, NY 10598, USA South Westphalia University of Applied Sciences 59494 Soest, Germany

Abstract. In the IBM LMT machine translation system, derivational morphological rules recognize and analyze words that are not found in its source lexicons, and generate default transfers for these unlisted words. Unfound words with no inflectional or derivational affixes are by default nouns. These rules are now expanded to provide lexical coverage of a particular set of words created on the fly in emails by bilingual Spanish-English speakers. What characterizes the approach is the generation of additional default parts of speech, and the use of morphological, semantic, and syntactic features from both source and target lexicons for analysis and transfer. A built-in rule-based strategy to handle language borrowing and code-mixing allows for the recognition of words with variable and unpredictable frequency of occurrence, which would remain otherwise unfound, thus affecting the accuracy of parsing and the quality of translation output. Keywords: Unfound words, rule-based morphology, derivational morphology, parsing, code-mixing, code-switching, borrowing, scoring, unsupervised email machine translation, languages in contact, Spanish-English.

1 Motivation R Using IBM’s WebSphere Translation Server software, nonprofit organizations and schools (teachers, administrators, parents, and students) access ¡TradúceloAhora! (TranslateNow!) to translate unedited emails bi-directionally (English-to-Spanish and Spanish-to-English). The main objective of IBM’s ¡TradúceloAhora! email system is to offer Spanish speakers with limited or no English skills the possibility to communicate with teachers and school administrators. The ¡TradúceloAhora! email system does not impose content or quality restrictions on emails to be submitted for translation. Neither does it require that the emails be exclusively in one language in order to ensure translation quality. In this highly unconstrained context, many bilingual speakers communicate just about anything mixing, intentionally or unintentionally, both Spanish and English in their emails. Users of ¡TradúceloAhora! expect the output to be intelligible to recipients regardless of the degree of language mixing and the overall grammatical quality of the input email.

C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 86–97, 2011. c Springer-Verlag Berlin Heidelberg 2011

Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing

87

2 Problem Most emails are written on the fly. Intentional and unintentional liberties are taken in spelling, omitting accents, punctuation, abbreviations, terminology, syntax, and visual expressivity (strings of repeated punctuation signs and symbols). These non-standard uses of language present interesting challenges to automatic MT systems, which perform reasonably well when the input is composed along prescriptive grammatical standards. In the context of two languages in contact, regardless of the level of proficiency in either of the two languages [1], bilingual email users present the unsupervised machine translation of emails with three additional special phenomena1 as shown in table 1. Table 1. Language borrowing, code-mixing, and code-switching Phenomenon

Borrowing

Code-mixing

Code-switching

Example

Fui a startear el carro

Tengo que babysit my sister

Estaba leyendo un libro. And suddenly he just got up and left

English Translation

I went to start the car

I have to babysit my sister

He was reading a book. And suddenly he just got up and left

These three phenomena are for the most part spontaneous. Factors like fatigue, short-term memory, playfulness, linguistic ignorance, personal preferences, shared experience with email addressee(s) or domain-specificity of the email content encourage bilingual email users to switch back and forth from one language to the other, and to borrow words from one language and adapt them to the other language. Unless a borrowing is well established in a bilingual community, it is difficult to predict which words bilingual email users will borrow as a whole and which words they will adapt to one language according to its word-formation rules. In addition to the adaptation to word-formation rules, writers also adapt the spelling to their native orthography. The new Spanish noun for English “roofer”, for example, can be spelled “roofero” (with 22,400 occurrences in Google on May 10th, 2011) or “rufero” (with 18,300 occurrences in Google on May 10th, 2011). At this point in the project, we are not dealing with such spelling variations. It is difficult to predict when and where bilingual users will switch from one language to the other [2]. As the following Spanish email example shows, there is a lot of unpredictable variability: (1) Fue a startear el carro, pero no arranco. (He went to start the car, but it did not start.) 1

(1) Borrowing (adopting a word from one language and adapting it to the morphological rules of the other language), (2) code-mixing or intrasentential code-switching (switching from one language to the other within sentence boundaries), and (3) intersentential code-switching (switching from one language to the other between sentence boundaries).

88

E. Manandise and C. Gdaniec

Large formal written corpora like Europarl [3] show that borrowing, code-mixing, and code-switching happen; they provide us with token examples, but the incidence of such phenomena is low. However, in the context of bilingual speakers communicating via email, it is our experience that language interferences are frequent

3 Rationale In LMT MT [4, 5, 6], source-language processes and rules of word formation analyze words that are not found in the source-language lexicon. They assign morpho-syntactic, semantic, and syntactic features; in the transformational component, rules generate target words [7, 8]. Unfound words are derived from one of the following: – A lexical base listed in the source lexicon – Morphologically-related words listed in the source lexicon – A dummy base dummy The morphological rules applied to unfound words are those of the source language without reference to the target language. Our linguistic intuition tells us that the phenomena of language in contact are not unruly. In particular, borrowings show evidence of productive word formation processes with affixation. The principle of compositionality assumed for both the analysis and transfer of unfound words in a monolingual setup can be extended to cover unfound words in a bilingual setup. In this context, source-language word formation rules together with source and target lexicons provide semantic and morpho-syntactic information to help resolve unfound words. In order to avoid re-analyses of the same unfound words in subsequent encounters [9, 10], the originally-unfound words are written to a lexical addendum in the relevant lexical format with the morphological, syntactic, and semantic information relevant to LMT [8]. Our strategy, which in effect exploits an existing strategy for unfound words in a monolingual setting, is being implemented for the Spanish-English language pair2 . We present here our preliminary conclusions.

4 Data Collection To the best of our knowledge, there exists no corpus, in the public domain, with a significant number of representative segments of Spanish-English borrowings, codemixing, and code-switching. Any such corpus, similar in size to Europarl [3] corpora, would allow more detailed linguistic or corpus-based analyses of these phenomena. In 2008, a data set of 8,000 words (with 1,516 different word forms) fed the development of a part-of-speech tagger [11]. 2

In response to inquiries by users of ¡TradúceloAhora! about the use of what is commonly called Spanglish, a pilot version using a small set of Spanish affixes was tested in June 2007.

Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing

89

The data we use for the analysis, testing, and evaluation comes from two sources: emails submitted to ¡TradúceloAhora! and texts collected from public chat web sites originating in the United States (about 141,200 words). Our rule-based approach to the language-in-contact phenomena, in particular borrowings, benefits from linguistic generalizations made on the basis of data found in our corpus. By assuming similarities in morphological processes and rules between affixes applied to borrowings found in the corpus and other source-language affixes that have not been found in the corpus, the scope of the analysis is expanded and the coverage of borrowings increases.

5 Related Work Of the three language-in-contact phenomena, code-switching has been the favorite field of inquiry of Natural Language Processing (NLP) [13, 14, 15]. Interest in code-mixing is rising and novel NLP strategies are being implemented [12, 16]. However, NLP and MT in particular have paid virtually no attention to the language-in-contact phenomenon of borrowing.

6 Brief Description of the LMT Morphological Analyzer The LMT Morphological Analyzer (MA) is a non-deterministic analyzer written in C [7, 8]. It consists of 3 steps: – Affix stripping and base spelling adjustments/stem changes – Lexical lookup – Affix operations In step 1, the input word is subjected to language-specific inflectional and derivational operations. The output of the analysis consists of a list of word structures made up of possible base words and affix lists. Regardless of the type of affix, the analyzer attempts to match substrings of the input word against context-sensitive rules. Several different substrings can be isolated in a word in an iterative process. In step 2, the possible base words are looked up in the lexicons. If found, they are returned to be processed in the third step. In step 3, rules apply to the affixes to determine whether the combination of a base word and the affixes yields a valid word of the language. For inflectional affixes, the operations assign morpho-syntactic features. For derivational affixes, they can assign a part of speech (POS), morpho-syntactic and semantic features, as well as syntactic arguments. The rules create a bracketed word structure, which is passed on for later transfer analysis in the LMT transformational component.

7 Analysis and Transfer of Borrowings 7.1 Principles and Definitions Our approach assumes the principle of compositionality for both the analysis and the transfer of borrowings. Borrowings consist of source-language affixes applied to a lexical base not listed in the source-language lexicon.

90

E. Manandise and C. Gdaniec

7.2 Goals The analysis of borrowings must – – – –

Isolate a base Isolate affixes Create complex, bracketed word structures Determine a part of speech together with morpho-syntactic features and, where possible, semantic and syntactic features – Flag the word as borrowed The transfer of borrowings must – Create a target-consistent string or subtree – Integrate the transfer and its modifiers correctly into the target tree

8 Borrowings Derivational and inflectional morphological rules apply to borrowings. As mentioned earlier, these words consist of known source-language affixes applied to bases listed in the target-language lexicon rather than the source-language lexicon. However, the analysis of borrowings is the same as that of derived words with bases listed in the source-language lexicon. Consider the words tranquilamente and startear, where tranquilamente is derived from a base listed in the source-language lexicon and startear from a base listed in target-language lexicon. The words are analyzed with the following structures: (2) tranquilamente ([tranquilo + mente] adverb suffixed manner) (3) startear ([start + ear] verb infinitive suffixed (penalty 10) borrowed/notfnd) Since the success of an analysis depends on a successful match in a lexicon, step 2 of the morphological analysis checks the target-language lexicon for a base when no match has been made in the source-language lexicon. Step 2 of the morphological analysis also returns a dummy base from the source lexicon in some or all cases of derivational affixes. The analysis based on a dummy base is made available because the word not listed in the source lexicon may, in fact, be borrowed from a language other than the target language; it might be missplled, be a newly-created source word, or be a legitimate source word simply omitted from the source-language lexicon. Which POS is returned depends in part on the affix(es). Of course, there are affixes that are POS ambiguous. In step 3, there are restrictions and penalties on borrowed and dummy analyses to make sure they do not win over real derivations. Semantic and morpho-syntactic features may be added. One of the purposes in checking the targetlanguage lexicon and in using a dummy base is to improve the parse when the POS of the unlisted word would be by default that of a noun, when in fact it is not. The target generation consists either in replacing the borrowings and dummies with the original word but without its source-language affix(es) or in using word-generation rules in the transformation component.

Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing

91

8.1 Examples of Borrowings Borrowed verbs, for instance, are made into regular Spanish verbs with the ending -ear (and less frequently with -ar). It is a simple but common type of borrowing. Assume the borrowed verb startear (to start). Step 1 returns the following bases and known affixes: (4)

Step 2 does not find startear, starte, nor start in the source lexicon. Now the target lexicon is checked. start is found and the base lexical information is retrieved. After step 3 applies affix operations (penalties, rewards, semantic and morpho-syntactic features) on the base start, the following word structure is returned: (5)

Let us consider the more complex made-up word anticooldad (anticoolness). Step 1 returns the prefix anti- and the suffix -dad, and the following bases with known affixes: (6)

Each of the isolated bases (anticooldad, anticool, cooldad, and cool) will be looked up in the source lexicon. In addition, anticooldad, anticool, cooldad, and cool are looked up in the target lexicon. Only cool is found. Step 3 applies affix operations on the target base cool and generates the following structure: (7)

Noun formation rules in the LMT transformation component can now be applied on the output. Instead of outputting the input word, language-specific rules manipulate the input word to generate a more appropriate output. Startear is output as start and anticooldad as anti-coolness. 8.2 Impact To appreciate the impact of this type of morphological analysis, compare the parses (9) and (10) below. Without appropriate derivational analysis of the borrowed verb, the parse for input sentence (8) is incomplete because startear is unfound and is treated by default as a noun. The analysis of startear with the lookup in the target-language lexicon and with known source-language affixes completes the parse. Among other things, parse completion permits target handling of pronoun referents, as shown in (10), where the pronoun it refers correctly to car.

92

E. Manandise and C. Gdaniec

(8) Miguel fue a startear su carro, pero no arrancó. (9) Parse without borrowing strategy Incomplete parse: Syntactic analysis no. 1 Evaluation = 0.750000... ----------------------------------------------------------------------o----- top incomplete(10) incomplete | .--- subj(n) miguel1(1) noun propn sg m h ‘-+--- u ir1(2,1,u,3) verb vfin vpast vsg vpers3 vindvsubj | ‘--- comp(p) a1(3,4) prep pprefv motionp (motionp pprefv) | ‘- objprep(n) startear(4) noun propn sg (notfnd) | .--- ndet su1(5) det sg m possdet ingdet ‘----- u carro1(6) noun cn sg m (st_vehicle m) ‘----- u pero1(7) conj | .--- vadv no1(8) adv ppadv neg ‘----- u arrancar1(9,u,u) verb vfin vpast vsg vpers3 vind ----------------------------------------------------------------------Miguel went to startear his car but he did not start.

(10) Parse with borrowing strategy Syntactic analysis no. 1 Evaluation = 17.531110.... ---------------------------------------------------------------------.------subj(n) miguel1(1) noun propn sg m h .-+------- lconj ir1(2,1,3,u) verb vfin vpast vsg vpers3 | ‘------- comp(pinf) a1(3,4) prep pprefv infobj motionp | ‘----- objprep(binf) (sfx startear)(4,u,6) verb vinf (borrowed/notfnd) | | .- ndet su1(5) det sg m possdet | ‘--- obj(n) carro1(6) noun cn sg m (st_vehicle) o--------- top pero1(7) verb vfin vpres vpast vsg | .------- vadv no1(8) adv ppadv neg ‘--------- rconj arrancar1(9,u,u) verb vfin vpres vpast vsg ----------------------------------------------------------------------Miguel went to start his car but it did not start.

9 Code-Mixing as Underived Words In addition to borrowings consisting of target language words with source language affixation, target words can find their way into a source language in two ways: – Without any derivational and/or inflectional source or target language affixation, e.g. babysit, cool – With derivational and/or inflectional target language affixation, e.g. trucks, candies, easily At this stage of our development, code-mixing is analyzed as a case of zero morphology. In step 1, the morphological analyzer analyzes an input word and returns a list of word structures consisting of possible bases and affixes. It always considers the input word a possible un-inflected base and lists it at the top of the list of possible bases. A dummy is also created.

Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing

93

Step 2 attempts to match an input word as a possible un-inflected base in the source lexicon. If there is no match, the target lexicon is checked for a possible candidate. If the match is successful, the target analysis is retrieved. Step 3 of MA assigns morpho-syntactic and semantic features to the base. Like with borrowings, a dummy base associated with various POS is created to allow for input words not found in either source or target lexicons. 9.1 Code-Mixing Examples Consider the following sentence: (11) Nan tiene que babysit a su hermanita en la noche. There are two unfound words in (11), namely, Nan and babysit. After a failed attempt at matching these two words in the Spanish source lexicon, step 2 of MA checks the target lexicon. Nan is not found, but babysit is and the verb analysis is carried over. Treating code-mixing as a special case of derivational morphology of zero-affixation, where the word has a base in the target but not in source lexicon, also greatly improves parsing. With a correct parse, transformation rules can then be applied to create an appropriate output. Compare the parses (12) and (13) of sentence (11) and their respective output: (12) Parse without code-mixing strategy Syntactic analysis no. 1 Evaluation = 0.870000... -----------------------------------------------------------------------o------- top incomplete(11) incomplete ‘------- u Nan(1) noun propn sg (notfnd) ‘------- u =tener que1(2,4,u) verb vfin vpres vsg vpers3 vind ‘----- subj(n) babysit(4) noun propn sg (notfnd) ‘----- vprep a1(5,7) prep pprefv motionp | | .- ndet su1(6) det sg m f possdet | ‘----- objprep(n)(sfx hermano ita)(7) noun cn sg f h (diminutive) ‘----- vprep en1(8,10) prep timepp | .- ndet la1(9) det sg f def (f def) ‘--- objprep(n) noche1(10) noun cn sg advnoun tm f (tm f) -----------------------------------------------------------------------Nan babysit has to to his little sister in the evening.

(13) Parse with code-mixing strategy Syntactic analysis no. 1 Evaluation = 3.722200.... -----------------------------------------------------------------------.------- subj(n) Nan(1,u,u) noun sg pl m f (notfnd) o ------ top =tener que1(2,1,4) verb ‘------- auxcomp(binf) babysit(4,1,5) verb vinf (code-mixing) ‘----- obj(a) a1(5,7) prep pprefv motionp | | .- ndet su1(6) det sg m f possdet | ‘--- objprep(n) (sfx hermano ita)(7) noun cn sg f h (diminutive) ‘----- vprep en1(8,10) prep timepp | .- ndet la1(9) det sg f def (f def) ‘--- objprep(n) noche1(10) noun cn sg advnoun tm f (tm f) -----------------------------------------------------------------------Nan has to babysit his little sister in the evening.

94

E. Manandise and C. Gdaniec

9.2 Code-Mixing as an Extension of the Dummy Strategy In code-mixing, the word from the target language may be affixed or inflected as the example in (14) shows. (14) Le pediste candies a mi hermano. (You asked my brother for candy.) Step 1 returns the possible bases and affixes for candies: (15)

In step 2 of MA, the lexical lookup will not find candies as a base either in the source or in the target lexicon. However, among the bases returned by step 1 of MA, there will be a dummy base, which is used for unfound words. Step 2 retrieves all the pertinent information (parts of speech, semantic and morpho-syntactic features) associated with the dummy base for the unlisted word. Two of the word structures that step 3 returns for candies are as follows: (16) candies ([dummy] noun m f sgpl (penalty 10) notfnd/dummy) (17) candies ([dummy] verb infinitive (penalty 20) notfnd/dummy) The transformational rules will output the original input word as transfer. The resulting parse for sentence (14) above with its corresponding output is as follows: (18) Parse with code-mixing strategy Syntactic analysis no. 1 Evaluation = 0.811000... ----------------------------------------------------------------------.----- vdat le1(1) noun pron sg pers3 dat m f h o----- top pedir1(2,u,3,4,u) verb vfin vpast vsg vpers2 vind ‘----- obj(n) candies(3,u,u) noun sgpl m f (notfnd/dummy) ‘----- iobj(p) a1(4,6) prep pprefv motionp | .- ndet mi1(5) det sg m possdet ‘--- objprep(n) hermano1(6) noun cn sg m h ----------------------------------------------------------------------You asked my brother for candies.

10 Results To appreciate the effectiveness of the new strategy, we conducted a simple evaluation. We randomly selected 100 segments (a total of 890 words) from the original corpus.3 3

The original corpus is routinely used to monitor improvements and degradations in response to development. Also, new segments are added to it.

Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing

95

Table 2. Output Evaluation (69 segments) Evaluators

Better

Similar

Worse

E1 E2

41 37

23 24

5 8

Table 3. Parse Evaluation (69 segments) Evaluator

Better

Similar

Worse

E3

40

20

9

We translated the segments using the Spanish-English system version with the new strategy and the system version without the mechanisms for handling borrowings and simple code-mixing. Two monolingual English users (E1, E2) rated the quality of the new English translated output as better (+), similar (=), or worse (−) than the previous output. In addition, a developer (E3) compared the well-formedness of the parses generated by the new and old system versions. Of the 100 input segments, 31 had identical translations in the two versions and were eliminated from the output evaluation set.

11 Observations The percentage of improvements versus degradations is a crude indicator of system quality [17]; however, it suggests that development is heading in a promising direction. The advantages of the new strategy outweigh its shortcomings as indicated in the lists below. Advantages 1. 2. 3. 4. 5. 6.

Better coverage of unfound words Coverage of dynamic word creation Coverage of mixed-language words Improved parses Improved, often acceptable, output Extension of an existing strategy (monolingual setup) to cover language-in-contact phenomena

Disadvantages 1. Less efficiency of processing time and space 2. Possible overgeneration of word structures 3. With additional word structures and POS for the dummy base, additional syntactic ambiguity for the parser 4. Strong constraints to rule out analyses through the use of penalties and rewards for both the morphological and syntactic analyses

96

E. Manandise and C. Gdaniec

5. Invalidation of dummy analyses whenever bases are listed in the lexicon 6. Unfound words that would be nouns in the right analysis (the default POS in LMT for all Spanish unfound words used to be a noun) could also be proposed as another POS 7. Transfers of borrowings and code-mixing may not be ideal

12 Future Development With code-mixing, the words of the target language that are being used in the source language may be inflected or have derivational affixation with or without inflection. At the moment, such words are treated as cases of zero affixation. However, target affixation may be crucial for POS disambiguation and, ultimately, for syntactic analysis and transfer assignment. The potential of such target affixation is being addressed and is the focus of future work.

13 Conclusion We have described how we exploit an existing strategy for handling unfound words in a monolingual source setup to analyze words created from target-language words with and without source-language affixation. We described how semantic and morphosyntactic features are assigned to these unfound words, and how transfers are generated. Derivational morphological analyses of borrowings and simple code-mixing improve parses and translation because every resolved part of speech and every semantic and morpho-syntactic feature contribute to improving the translation of a sentence or segment. A rule-based approach to borrowings and to code-mixing benefits from linguistic generalizations about the individual languages that are coming in contact in texts (in our case, in emails), and from making both source and target lexicons available for lookup. The great variability and unpredictable frequency of occurrence of specific borrowings and code-mixing instances make corpus-based analyses of these two phenomena difficult. A rule-based approach to borrowings and code-mixing, in contrast, can rely on existing rules that are needed by the morphological analyzer independently from these two phenomena. Acknowledgments. We would like to thank Dr. Michael McCord for his comments and suggestions on the topic discussed in this paper.

References 1. Gardner-Chloros, P., Edwards, M.: Assumptions behind grammatical approaches to codeswitching: when the blueprint is a red herring. Transactions of the Philological Society 102(1), 103–129 (2004) 2. Solorio, T., Liu, Y.: Learning to Predict Code-switching Points. In: Proceedings of Empirical Methods on Natural Language Processing, pp. 973–981 (2008) 3. A Parallel Corpus for Statistical Machine Translation,

Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing

97

4. McCord, M., Wolff, S.: The Lexicon and Morphology for LMT. IBM Research Division Research Report, RC 13403 (1988) 5. McCord, M.C., Bernth, A.: The LMT Transformational System. Machine Translation and the Information Soup. In: Proceedings of the 3rd AMTA Conference, pp. 344–354. Springer, Heidelberg (1998) 6. McCord, M.C.: Slot Grammar: A system for simple construction of practical natural language grammars. In: Studer, R. (ed.) Natural Language and Logic: International Scientific Symposium, pp. 118–145. Springer, Berlin (1990) 7. Gdaniec, C., Manandise, E., McCord, M.: Derivational Morphology to the Rescue: How It Can Help Resolve Unfound Words in MT. In: Hutchins, J. (ed.) Proceedings, MT Summit VIII, Santiago (2001); CD edn., compiled by Hutchins, J. 8. Gdaniec, C., Manandise, E.: Using Word Formation Rules to Extend MT Lexicons. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 64–73. Springer, Heidelberg (2002) 9. Cartoni, B.: Lexical Morphology in Machine Translation: a Feasibility Study. In: Proceedings of of the 12th Conference of the European Chapter of the ACL, pp. 130–138 (2009) 10. Meni, A., Goldberg, Y., Gabay, D., Elhadad, M.: Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis. In: Proceedings of ACL 2008, HLT, pp. 728–736 (2008) 11. Solorio, T., Liu, Y., Medina, B.: Part-of-speech Tagging English-Spanish Code-switched Text. In: Proceedings of Empirical Methods on Natural Language Processing (2008) 12. Franco, J.C., Solorio, T.: Baby-Steps towards Building a Spanglish Language Model. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 75–84. Springer, Heidelberg (2007) 13. Goyal, P., Mital, M.R., Mukerjee, A., Raina, A.M., Sharma, D., Vikram, K.: Saarthaka: A Bilingual Parser for Hindi, English and Code-Switching Structures. In: Proceedings of EACL 2003, European Chapter of the Association for Computational Linguistics, Budapest, pp. 15– 24 (2003) 14. Joshi, A.: Processing of Sentences with Intra-sentential Code-switching. In: Horecky, J. (ed.) c Academia, Amsterdam (1982) COLING 1982. North-Holland Publishing Company 15. Sinha, R.M.K., Thakur, A.: Machine Translation of Bi-lingual Hindi-English (Hinglish) Text. In: Proceedings of the 10th Conference on Machine Translation, Phuket, Thailand, pp. 149– 156 (2005) 16. Alex, B., Dubey., A., Keller, F.: Using Foreign Inclusion Detection to Improve Parsing Performance. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 151–160 (2007) 17. Rinsche, A.: Towards a MT Evaluation Methodology. In: Proceedings of the Fifth International Conference on Theoretical and Methodological Issues in Machine Translation, Kyoto, Japan, July 14-16 (1993),

A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, and Josef van Genabith School of Computing, Dublin City University, Dublin, Ireland

Abstract. Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire and filter lexical knowledge about morpho-syntactic attributes and inflection paradigms. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, the Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit.1 We build a web application, AraComLex (Arabic Computer Lexicon),2 for managing and curating the lexical database. Keywords: Arabic Lexical Database, Modern Standard Arabic, Arabic morphology, Arabic Morphological Transducer.

1 Introduction Lexical resources are essential in most Natural Language Processing (NLP) applications such as text summarisation, classification, indexing, information extraction, information retrieval, machine-aided translation and machine translation. A lexicon is a core component of any morphological analyser [1,2,3,4]. The quality and coverage of the lexical database determines the quality and coverage of the morphological analyser, and limitations in the lexicon will cascade through to higher levels of processing. A lexical database intended for NLP purposes differs from traditional dictionaries in that information on inflection and derivation in the former needs to be represented in a formal and fully explicit way. Existing Arabic dictionaries are not corpus-based (as in a COBUILD approach [5]), but rather reflect historical and prescriptive perspectives, making no distinction between 1 2

C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 98–118, 2011. c Springer-Verlag Berlin Heidelberg 2011

A Lexical Database for MSA Interoperable with an FST Transducer

99

entries from Modern Standard Arabic (MSA) and Classical Arabic (CA). Therefore, they tend to include obsolete words that have no place in current usage. Current computational resources, such as the Buckwalter Arabic Morphological Analyzer (BAMA) [3] and its successor, the Standard Arabic Morphological Analyzer (SAMA) [6], have inherited this drawback from older dictionaries. For example, SAMA contains several thousands of entries that are hardly modern Ara ever encountered by bic speakers, such as qalat. ‘to stain’, qalfat. ‘to spoil’, istakadda ‘to exhaust’, and g˙ amlaˇg ‘unstable’. As a consequence morphological analyses for MSA texts contain many “spurious” interpretations that increase the ambiguity level and complicate text processing. We address this problem at the lexicographic and computational levels by deriving a specialised MSA lexical resource and generating a Finite State Technology (FST) morphological transducer based on that resource. We start with a manually crafted small MSA seed lexical resource [2] which we take as a model. We extended our lexical resource using SAMA’s database. We use web search queries and statistics from a large automatically annotated MSA corpus (containing 1,089,111,204 words) as two separate filters to determine which lexical information in SAMA is truly representative of MSA and which is CA. Words attested in the MSA data are included in the lexicon, while the others are filtered out. The filtering stage results in a raw MSA lexical resource that does not contain all the information we need in order to build a complete computational lexicon. For example, our small, hand-crafted seed lexicon [2] includes information about the continuation classes (or inflection paradigms) and humanness for nominal entries, and transitivity and passive/imperative transformations for verbs. This information, however, is missing in the SAMA entries. To solve this problem we use machine learning techniques to help add the new features. We develop a web application, AraComLex, for curating MSA lexical information. AraComLex provides an interface between the human lexicographer and the lexical database, and provides facilities for editing, maintaining and extending the list of entries. AraComLex complies with the LMF standard in naming convention and hierarchical structure. AraComLex is also used to automatically extend the Arabic FST morphological transducer. We test our morphological transducer for coverage and number of analyses per word, and compare the results to the older version of the transducer as well as to SAMA. This paper is structured as follows. In the introduction we describe the motivation behind our work. We differentiate between MSA, the focus of this research, and CA which is a historical version of the language. We give a brief history of Arabic lexicography and describe how outdated words are still abundant in current dictionaries. Then we outline the Arabic morphological system to show what layers and tiers are involved in word derivation and inflection. In Section 2, we explain the corpus used in the construction of our lexical database and the standards and technologies employed, mainly Lexical Markup Framework (LMF) and FST. In Section 3, we describe AraComLex, a web application we built for curating our lexical resource. We present the results obtained so far in building and extending the lexical database using a data-driven

100

M. Attia et al.

filtering method and machine learning techniques. We outline how the lexical database is used in creating an open-source FST morphological analyser and evaluate the results. In Section 4, we point to future work, and finally, Section 5 gives the conclusion. 1.1 Modern Standard Arabic vs. Classical Arabic Modern Standard Arabic (MSA), the subject of our research, is the language of modern writing, prepared speeches, and the language of the news. It is the language universally understood by Arabic speakers around the world. MSA stands in contrast to both Classical Arabic (CA) and vernacular Arabic dialects. CA is the language which appeared in the Arabian Peninsula centuries before the emergence of Islam and continued to be the standard language until the medieval times. CA continues to the present day as the language of religious teaching, poetry, and scholarly literature. MSA is a direct descendent of CA and is used today throughout the Arab World in writing and in formal speaking [7]. MSA is different from Classical Arabic at the lexical, morphological, and syntactic levels [8], [9], [10]. At the lexical level, there is a significant expansion of the lexicon to cater for the needs of modernity. New words are constantly coined or borrowed from foreign languages. The coinage of new words does not necessarily abide by the classical morphological rules of derivation, which frequently leads to contention between modern writers and more traditional philologists. Although MSA conforms to the general rules of CA, MSA shows a tendency for simplification, and modern writers use only a subset of the full range of structures, inflections, and derivations available in CA. For example, Arabic speakers no longer strictly abide by case ending rules, which led some structures to become obsolete, while some syntactic structures which were marginal in CA started to have more salience in MSA. For example, the word order of object-verb-subject, one of the classical structures, is rarely found in MSA, while the relatively marginal subject-verb-object word order in CA is gaining more weight in MSA. This is confirmed by Van Mol [11] who quotes Stetkevych [12] as pointing out the fact that MSA word order has shifted balance, as the subject now precedes the verb more frequently, breaking from the classical default word order of verb-subject-object. Moreover, to avoid ambiguity and improve readability, there is a tendency to avoid passive verb forms when the active readings are also possible, as in the words nuz.z.ima ‘to be organised’ and wuttiqa ‘to be documented’. Instead of the passive form, the ¯¯ tamma ‘performed/done’ + verbal noun is used, alternative syntactic construction ! tamma tanziymuhu ‘lit. organising it has been done / it was organised’, and " tamma tawt.iyquhu ‘lit. documenting it has been done / it was documented’. ¯ To our knowledge, apart from Van Mol’s [11] study of the variations in complementary particles, no extensive empirical studies have been conducted to check how significant the difference between MSA and CA is either at the morphological, lexical, or syntactic levels. 1.2 A Brief History of Arabic Lexicography Kitab al-’Ain by al-Khalil bin Ahmed al-Farahidi (died 789) is the first complete Arabic monolingual dictionary. It was a comprehensive descriptive record of the lexicon of the

A Lexical Database for MSA Interoperable with an FST Transducer

101

contemporary Arabic language at the time. It did not just record the high level formal language of the Koran, Prophet’s sayings, poetry, and memorable pieces of literature and proverbs, but it also included a truthful account of common words and phrases as used by Bedouins and common people. The other dictionaries that were compiled in the centuries following al-’Ain typically included either refinement, expansion, correction, or organisational improvements of the previous dictionaries. These dictionaries include Tahzib al-Lughah by Abu Mansour al-Azhari (died 980), al-Muheet by al-Sahib bin ’Abbad (died 995), Lisan al-’Arab by ibn Manzour (died 1311), al-Qamous al-Muheet by al-Fairouzabadi (died 1414) and Taj al-Arous by Muhammad Murtada al-Zabidi (died 1791) [13]. Even relatively modern dictionaries such as Muheet al-Muheet (1869) by Butrus alBustani and al-Mu’jam al-Waseet (1960) by the Academy of the Arabic Language in Cairo did not start from scratch, nor did they try to overhaul the process of dictionary compilation or make any significant change. Their aim was mostly to preserve the language, refine older dictionaries, and accommodate accepted modern terminology. Some researchers criticise Arabic dictionaries for representing a fossilised version of the language with each new one reflecting the content of the preceding dictionaries [14]. Until today, to our knowledge, these remarks are still true. Noteworthy work in bilingual Arabic lexicography was done by Arabists, most notable among them were Edward William Lane in the nineteenth century and Hans Wehr in the twentieth century. Edward William Lane’s Arabic–English Lexicon (compiled between 1842 and 1876) was strongly indebted, as admitted by Lane himself [15], to previous Arabic monolingual dictionaries, chiefly the Taj al-Arous by Muhammad Murtada al-Zabidi (1732-1791). Lane spent seven years in Egypt acquiring materials for his dictionary and ultimately helped preserve the decaying and mutilated manuscripts he relied on [16]. The most renowned and celebrated Arabic–English dictionary in the modern time is Wehr’s Dictionary of Modern Written Arabic (first published in 1961). The work started as an Arabic–German dictionary Arabisches Wörterbuch für die Schriftsprache der Gegenwart, published in 1952, and was later translated to English, revised and extended. The dictionary compilers, Wehr and Cowan, stated that their primary goal was to follow the descriptive and scientific principles by including only words and expressions that were attested in the corpus they collected [17]: “From its inception, this dictionary has been compiled on scientific descriptive principles. It contains only words and expressions which were found in context during the course of wide reading in literature of every kind or which, on the basis of other evidence, can be shown to be unquestionably a part of the present-day vocabulary.” This was an ambitious goal indeed, but was the application up to the stated standard? We find three main defects that, in practice, defeated the declared purpose of the dictionary. The first is in data collection, the second is in the use of secondary sources, and the third is in their approach to idiosyncratic classicisms. Data collection was conducted between 1940 and 1948, and the data included 45,000 slips containing citations from

102

M. Attia et al.

Arabic sources. These sources consisted of selected works by poets, literary critics, and writers immersed in classical literature and renowned for their grandiloquent language such as Taha Husain, Muhammad Husain Haikal, Taufiq al-Hakim, Mahmoud Taimur, al-Manfalauti, Jubran Khalil Jubran, and Amin ar-Raihani (as well as some newspapers, periodicals, and specialised handbooks). These writers appeared at a time known in the history of Arabic literature as the period of Nahda, which means revival or Renaissance. A distinctive feature of many writers in this period was that they tried to emulate the famous literary works in the pre-Islamic era and the flourishing literature in the early centuries after Islam. This makes the data obviously skewed by favouring literary, imaginative language. The second defect is that the dictionary compilers used some of the then available Arabic–French and Arabic–English dictionaries as “secondary sources”. Items in the secondary sources for which there were no attestations in the primary sources, i.e. corpus data, were left to the judgement of an Arabic native speaker collaborator in such a way that words known to him, or already included in older dictionaries, were incorporated. The use of secondary sources in this way was a serious fault because of the subjectivity of decisions, and this was enough to damage the reliability of Wehr’s dictionary as a true representation of the contemporary language. The third drawback was the dictionary compilers’ approach to what they defined as the problem of classicisms, or rare literary words. Despite their full understanding of the nature of these archaic forms, the decision was to include them in the dictionary, even though it was sometimes evident that they “no longer form a part of the living lexicon and are used only by a small group of well-read literary connoisseurs” [17]. The inclusion of these rarities inevitably affected the representativeness of the dictionary and marked a significant bias towards literary forms. Not too far away from the domain of lexicography, two Arabic word count studies appeared in 1940 and 1959 but did not receive the attention they deserved from Arabic lexicographers, perhaps because the two works were intended for pedagogical purposes to aid in the vocabulary selection for primers and graded readers. The first was Moshe Brill’s work [18] which was a pioneering systematic study in Arabic word count. Brill conducted a word count on 136,000 running words from the Arabic daily press, and the results were published as The Basic Word List of the Arabic Daily Newspaper (1940). This word count was used as a basis for a useful Arabic–Hebrew dictionary compiled by two assistants of Brill. In 1959 Jacob Landau tried to make up for what he perceived as a technical shortcoming in Brill’s work: the count covered only the language of the daily press. He complemented Brill’s work by conducting a word count on an equal portion of 136,000 running words from Arabic prose based on 60 twentieth-century Egyptian books on a selection of various topics and domains including fiction, literary criticism, history, biography, political science, religion, social studies, and economics with some material on the borderline between fiction and social sciences, e.g. travels and historical novels. It seems that Landau went into great detail in collecting this well-balanced corpus which pre-dates the discipline of corpus linguistics and the first electronic corpus, the Brown Corpus [19]. Landau combined his work with Brill’s work in a book called A Word Count of Modern Arabic Prose [20]. The outcome was the result of two word

A Lexical Database for MSA Interoperable with an FST Transducer

103

counts: Brill’s count of the press usage, and Landau’s count of literary usage. The former included close to 6,000 separate words; the latter over 11,000, and the combined list 12,400 words. Through this frequency study, Landau was able to bring useful insights from the frequency statistics which basically complied with Zipf’s law. He noted that the first 25 words with the highest frequency represented 25% of the total number of running words; the first 100, more than 38%; the first 500, 58.5%; and the first 1,000, 70%. He also found that 1,134 words occurred only once each in the press, and 3,905 words occurred only once in literature, which reflects the abundance of rare words in literary works. An obvious weakness of this study, as admitted by the author himself, was that the number of running words counted (only 272,000 words) was inadequately small in comparison to the word count for other languages at the time such that for English (25,000,000), and German (11,000,000). 1.3 Current State of Arabic Lexicography Until now, there is no large-scale lexicon (computational or otherwise) for MSA that is truly representative of the language. Al-Sulaiti [21] emphasises that most existing dictionaries are not corpus-based. Ghazali and Braham [14] point out the fact that traditional Arabic dictionaries are based on historical perspectives and that they tend to include obsolete words that are no longer in current use. They stress the need for new dictionaries based on an empirical approach that makes use of contextual analysis of modern language corpora. The Buckwalter Arabic Morphological Analyzer (BAMA) [3] is widely used in the Arabic NLP research community. It is a de facto standard tool, and has been described as the “most respected lexical resource of its kind” [22]. It is designed as a main database of 40,648 lemmas supplemented by three morphological compatibility tables used for controlling affix-stem combinations. Other advantages of BAMA are that it provides information on the root, reconstructs vowel marks and provides an English glossary. The latest version of BAMA is renamed SAMA (Standard Arabic Morphological Analyzer) version 3.1 [6]. Unfortunately, there are some drawbacks in the SAMA lexical database that raise questions for it to be a truthful representation of MSA. We estimate that about 25% of the lexical items included in SAMA are outdated based on our data-driven filtering method explained in Section 3.2. SAMA suffers from a legacy of heavy reliance on older Arabic dictionaries, particularly Wehr’s Dictionary [17], in the compilation of its lexical database. Therefore, there is a strong need to compile a lexicon for MSA that follows modern lexicographic conventions [23] in order to make the lexicon a reliable representation of the language. There are only a few recorded attempts to break the stagnant water in Arabic lexicography. Van Mol in 2000 [24] developed an Arabic–Dutch learner’s dictionary of 17,000 entries based on corpus data (3,000,000 words), which were used to derive information on contemporary usage, meanings and collocations. He considered his work as the first attempt to build a COBUILD-style dictionary [5]. More recently, Boudelaa

104

M. Attia et al.

and Marslen-Wilson in 2010 [25] built a lexical database for MSA (based on a corpus of 40 million words) which provided information on token and type frequencies for psycholinguistic purposes. Our work represents a further step to address this critical gap in Arabic lexicography. We use a large corpus of one billion words to automatically create a lexical database for MSA. We follow the LMF naming conventions and hierarchical structures, and we provide complete information on inflection paradigms, root, patterns, humanness (for nouns), and transitivity (for verbs). This lexical database is interoperable with a finitestate morphological transducer. 1.4 Arabic Morphotactics Arabic morphology is well-known for being rich and complex. The reason behind this complexity is the fact that it has a multi-tiered structure where words are originally derived from roots and pass through a series of affixations and clitic attachments until they finally appear as surface forms. Morphotactics refers to the way morphemes combine together to form words [26], [27]. Generally speaking, morphotactics can be concatenative, with morphemes either prefixed or suffixed to stems, or non-concatenative, with stems undergoing internal alterations to convey morpho-syntactic information [28]. Arabic is considered as a typical example of a language that employs concatenative istamaluw-h¯ both and non-concatenative morphotactics. For example, the verb # $" a wa-’l-istam¯al¯at ‘and-the-uses’ & # ‘they-used-it’ and the noun %& both come from the root ' ml. Figure 1 shows the layers and tiers embedded in the Arabic morphological system. The derivation layer is non-concatenative and opaque in the sense that it is a sort of abstraction that affects the choice of a part of speech (POS), and it does not have a direct explicit surface manifestation. By contrast, the inflection layer is more transparent. It applies concatenative morphotactics by using affixes to express morpho-syntactic features. We note that verbs at this level show what is called ‘separated dependencies’ which means that some prefixes determine the selection of suffixes. From the analysis point of view, we note that stemming is conducted in the inflection layer and it can be done at two levels: either by stripping off tier 5 alone, producing " istamaluw istim¯al¯at in our examples, or tier 5 along with tier 4, producing ' # and %& istamal and (# istim¯al. It must be noted that automatic stemming that removes clitics and/or affixes will tend to produce forms that do not resemble actual words unless there is a way to ameliorate the effect of alterations. According to our analysis of a small subset of data (from the Arabic Gigaword Corpus) containing 1,664,181 word tokens, we found that there are 125,282 unique types, or full form words from among the open classes: nouns, verbs, and adjectives. Stemming at the clitics level (tier 5) produces 42,145 unique stems, that is a reduction of 66% of the types. Lemmatisation produces 19,499 unique lemmas, that is a reduction of 54% of the stems, and a reduction of 84% of the types. This shows that lemmatisation is very effective in reducing data sparseness for Arabic. In the derivational layer Arabic words are formed through the amalgamation of two tiers, namely root and pattern. A root is a sequence of three consonants and the pattern is a template of vowels with slots into which the consonants of the root are inserted. This process of insertion is called interdigitation [4]. An example is shown in table 1.

A Lexical Database for MSA Interoperable with an FST Transducer

105

Fig. 1. Arabic morphology’s multi-tier structure

2 Methodology In this section, we explain the techniques and standards we followed in the construction of our lexical resource. 2.1 Using Heuristics and Statistics from a Large Corpus For the construction of a lexicon for MSA we take advantage of large and rich resources that have not been exploited in similar tasks before. We use a corpus of 1,089,111,204 words, consisting of 925,461,707 words from the Arabic Gigaword corpus fourth edition [29] in addition to 163,649,497 words from news articles we collected from the AlJazeera web site.3 One concern about this corpus is that it might not be as well-balanced as could be desired due to the fact that it is taken from only one domain, namely the news domain. However, to the best of our knowledge, this is the only large-scale corpus available for Arabic to date. Moreover, newspapers and websites tend to cover a variety of topics in addition to news. For example, the Al-Jazeera website covers, beside news, topics such as science, sports, art and culture, book reviews, economics, and health. 3

. Collected in January 2010.

106

M. Attia et al. Table 1. Root and Pattern Interdigitation

Root

drs Patterns POS Stem

R1 aR2 aR3 a V darasa ‘study’

R1 aR2 R2 aR3 a V darrasa ‘teach’

R1 a¯ R2 iR3 N d a¯ r i s ‘student’

muR1 aR2 R2 iR3 N mudarris ‘teacher’

We pre-annotate the corpus using MADA [30,31,32], a state-of-the-art tool for morphological processing. MADA combines SAMA and SVM classifiers to choose the best morphological analysis for a word in context, doing tokenisation, lemmatisation, diacritisation, POS tagging, and disambiguation. MADA is reported to achieve high accuracy (above 90%) for tokenisation and POS tagging tested on the Arabic Penn Treebank, but no evaluation of lemmatisation is reported. We use MADA and a data-driven filtering approach, described in Section 3.2, to identify core MSA lexical entries. For the annotated data we collect statistics on lemma features and use machine learning techniques, also described in Section 3.2, in order to extend a manually constructed seed lexicon. We use machine learning to specifically predict new features that are not provided either by SAMA or MADA such as continuation classes, humanness, and transitivity. 2.2 Using State-of-the-Art Standards for Lexical Resource Representation Over the past decade, there has been a growing tendency to standardise lexical resources by specifying the architecture and the component parts of the lexical model. There is also a need to specify how these components are interconnected and how the lexical resource as a whole exchanges information with other NLP applications. Lexical Markup Framework (LMF) [33,34] has emerged as an ISO standard that provides the specifications of the lexical database not for a particular language, but presumably any language. LMF provides encoding formats, naming conventions, and a hierarchical structure of the components of lexical resources to ensure consistency. LMF was published officially as an international standard in 2008 and is now considered the state of the art in NLP lexical resource management. The purpose of LMF is to facilitate the exchange of lexical information between different lexical resources on the one hand and between lexical resources and NLP applications on the other. LMF takes into account the particular needs of languages with rich and complex morphology, such as Arabic. Figure 2 shows the Arabic root management in LMF and how verbs and nouns are linked through the common root. 2.3 Using Finite State Technology One of our objectives for constructing the lexical resource is to build a morphological analyser and generator using bidirectional finite state technology (FST). FST has been used successfully in developing morphologies for many languages, including Semitic

A Lexical Database for MSA Interoperable with an FST Transducer

107

Fig. 2. LMF Arabic root management (adapted from [34])

languages [27]. There are a number of advantages of this technology that makes it especially attractive in dealing with human language morphologies; among these are the ability to handle concatenative and non-concatenative morphotactics, and the high speed and efficiency in handling large automata of lexicons with their derivations and inflections that can run into millions of paths.

3 Results to Date In this section we present the results we obtained so far in building and extending the lexical database. We describe a web application, AraComLex, we built for maintaining and curating our lexical resource. We also outline our test case, namely an open-source FST morphological analyser which is based on our lexical database. 3.1 Building Lexical Resources There are three key components in the Arabic morphological system: root, pattern, and lemma. In order to accommodate these components, we create four lexical databases: one for nominal lemmas (including nouns and adjectives), one for verb lemmas, one for word patterns, and one for root-lemma lookup. From a manually created MSA lexicon [2] we construct a seed database of 5,925 nominal lemmas and 1,529 verb lemmas. At the moment, we focus on open word classes and exclude proper nouns, function words, and multiword expressions which are relatively stable and fixed from an inflectional point of view. We build a database of 380 Arabic patterns (346 for nominals and 34 for verbs) which can be used as indicators of the morphological inflectional and derivational behaviour of Arabic words. Patterns are also powerful in the abstraction and course-grained categorisation of word forms. In our lexicon, we account for 93.2% of all nominals using a set of 94 pre-defined patterns implemented as regular expressions. We create a lemma-root look up database containing all lemmas and their roots. On the surface, nominals and verbs have different, unrelated inflection paradigms. But in

108

M. Attia et al.

Fig. 3. Entity-relationship diagram of AraComLex

fact, they are closely interconnected through the common root. For example, if a root is capable of producing a transitive verb, it can also produce a passive participle. 3.2 AraComLex Lexical Management Application In order to manage our lexical database, we have developed the AraComLex lexicon authoring system4 which provides a graphical user interface for human lexicographers to curate the automatically derived lexical and morphological information. We use AraComLex for storing the lexical resources mentioned in Section 3.1 as well as generating data for machine learning, storing extensions to the lexicon, and generating data for the morphological transducer, as explained in the following subsections. Figure 3 shows the entity-relationship diagram [35] of the database used in the AraComLex application. In this diagram, entities are drawn as rectangles and relationships as diamonds. Relationships connect pairs of entities with given cardinality constraints (represented as numbers surrounding the relationship). Three types of cardinality constraints are used in the diagram: 0 (entries in the entity are not required to take part in the relationship), 1 (each entry takes part in exactly one relationship) and n (entries can take part in an arbitrary number of relationships). Entities correspond to tables in the database, while relationships model the relations between the tables. In AraComLex, we provide the key lexical information for bootstrapping and extending a morphological processing engine. AraComLex covers the LMF’s morphology extension by listing the relevant morphological and morpho-syntactic features for each lemma. We use finite sets of values implemented as drop-down menus to allow lexicographers to edit entries while ensuring consistency, as shown in figure 4. Two of the innovative features added are the “±human” feature and the 13 continuation classes which stand for the inflection grid, or all possible inflection paths, for nominals. 4

A Lexical Database for MSA Interoperable with an FST Transducer

109

Fig. 4. AraComLex Lexicon Authoring System for nominals with support statistics

Figure 4 shows the features specified for nominal lemmas in AraComLex. The feature “partOfSpeech” can be either ‘noun’, ‘noun_prop’, ‘noun_quant’, ‘noun_num’, ‘adj’, ‘adj_comp’, and ‘adj_num’. The “lemma_morph” feature can be either ‘masc’ or ‘fem’ for nouns and can also be ‘unspec’ (unspecified) for adjectives. The “human” feature can be either ‘yes’, ‘no’, or ‘unspec’. There are 13 continuation classes for Arabic nominals as shown in table 3 which represent the inflection grid (or all possible inflection paths) for nominals. For verb lemmas we provide information on whether the verb is transitive or intransitive and whether it allows passive and imperative inflection. For patterns we specify whether a pattern is nominal or verbal, whether it indicates a broken plural or singular form. Overall, we have 12 types of patterns: verbs, singular nouns, broken plural nouns, masdar (verbal noun), names of instruments, active participles, passive participles, marrah (instance), mubalaghah (exaggeration), comparative adjectives, mushabbahah (semi-adjectives), and names of places. 3.3 Extending the Lexical Database In extending our lexicon, we rely on Attia’s manually-constructed finite state morphology [2] and the lexical database in SAMA 3.1 [6]. Creating a lexicon is usually a labour-intensive task. For instance, Attia took three years in the development of his morphology, while SAMA and its predecessor, Buckwalter’s morphology, were developed over more than a decade, and at least seven people were involved in updating and maintaining the morphology. In this project we want to automatically extend Attia’s finite state morphology using SAMA’s database, but we need to solve two problems. First, SAMA suffers from a legacy of obsolete entries and we need to filter out these outdated words, as we want to enrich our lexicon only with lexical items that are still in current use. Second, our lexical database requires features (such as humanness for nouns and transitivity for verbs) that are not provided by SAMA, and we want to automatically induce these features. 3.3.1 Lexical Enrichment. To address the first problem we use a data-driven filtering method that combines open web search engines and our pre-annotated corpus. Using statistics5 from three web search engines (Al-Jazeera,6 Arabic Wikipedia,7 and 5 6 7

Statistics were collected in January 2011.

110

M. Attia et al. Table 2. Arabic Inflection Grid and Continuation Classes

Masculine Singular 1

2 3

Feminine Plural

muallimat

muallim¯an

muallima-t¯an

muallimuwn

muallim¯at

—

—

—

—

.ta¯ lib ‘student’

taliymiyyat baqarat ‘cow’

tan¯ azul ‘concession’

imtih.a¯ n

% $

mah.d.at

—

‘exam’ 9 10 11

.tayy¯ ar ‘pilot’

( kit¯ ab ‘book’ + )*

.ta¯ lib¯an

.ta¯ libat¯an

taliymiyy¯an —

13

, ' - huruwˇ g – ‘exiting’

. ah.it mab¯ ‘investigators’ ¯

.ta¯ lib¯at

Fem-MascduFemdu-MascplFempl Fem-MascduFemdu-Fempl Fem-MascduFemdu

taliymiyya-t¯an

baqar¯at Femdu-Fempl

baqarat¯an —

—

!"

—

—

Fempl

—

—

Femdu

—

—

—

Fem

—

—

d.ah.iyyat¯an —

&

imtih.a¯ n¯an —

—

&

Mascdu-Femdu

—

Mascdu-Mascpl

—

Mascdu

—

Mascpl

imtih.a¯ n¯at —

'

.tayy¯aruwn — —

(

kit¯ab¯an

—

— —

diymuqr¯a.tiyy ‘democrat’ 12

Continuation Class

tan¯azul¯at

# $ mah . d. ‘mere’

&

— !" d.ah.iyyat ‘victim’

—

.ta¯ libat

—

6

8

Feminine Dual Masculine Plural

4

7

Masculine Dual

muallim ‘teacher’

taliymiyy ‘educational’

5

Feminine Singular

— +

)*

diymuqr¯a.tiyyuwn —

—

—

—

—

NoNum

—

—

—

—

—

Irreg_pl

the Arabic BBC website8 ), we find that 7,095 lemmas in SAMA have zero hits, leaving only 33,553 as valid forms in MSA. Corpus statistics from our corpus, described in Section 2.1, show that 3,604 lemmas are not used in the corpus at all, leaving 37,044 lemmas that have at least one instance in the corpus. Of those, 4,471 lemmas occur less than 10 times, leaving 32,573 as more stable lemmas. Combining web statistics and corpus statistics, we find that there are 30,739 lemmas that returned at least one hit in the web queries and occurred at least once in the corpus. Only 29,627 lemmas are left if we consider lemmas that have at least one hit on the web data and that occurred at least 10 times in the corpus. Using a threshold of 10 occurrences here is discretionary, but the aim is to separate the stable core of the language from instances where the use of a word is perhaps accidental or idiosyncratic. We consider the refined list as representative of the lexicon of MSA as attested by our statistics. 8

A Lexical Database for MSA Interoperable with an FST Transducer

111

Table 3. Results of the classification experiments Classes

Features

P

R

F

0.62

0.65

0.63

0.86

0.87

0.86

0.85

0.86

0.85

0.85

0.85

0.84

0.72

0.72

0.72

0.63

0.65

0.64

Nominals Continuation Classes (13 classes) Human (yes, no, unspec) POS (Noun, Adjective)

number, gender, case, clitics

Verbs Transitivity (Transitive, Intransitive) Allow Passive (yes, no) Allow Imperative (yes, no)

number, gender, person, aspect, mood, voice, clitics

We note that web statistics and corpus statistics are substantially different. Web searching does not allow diacritisation and does not have form disambiguation, while the corpus is automatically diacritised and disambiguated using MADA. Therefore, we believe that both statistics are complementary: web queries will show how likely the word form is in current use, and the corpus statistics will indicate how likely it is that a certain interpretation is valid. It is problematic, if at all necessary, to manually create a gold standard that indicates whether a word belongs to MSA or CA. A human decision in this regard can be highly biased, subjective and dependent on each annotator’s perception and background education. Therefore, we assume that the safest criteria for selecting MSA entries is to investigate whether or not a certain entry is attested in a large and representative modern corpus. 3.3.2 Feature Enrichment. To address the second problem, we use a machine learning classification algorithm, the Multilayer Perceptron [36,37], to build a model for predicting the required features for each new lemma. We use two manually annotated datasets of 4,816 nominals and 1,448 verbs. We feed these datasets with statistics from our pre-annotated corpus and build these statistics into a vector grid. The features that we use for nominals are number, gender, case and clitics; and for verbs: number, gender, person, aspect, mood, voice and clitics. For the implementation of the machine learning algorithm we use the open-source application Weka version 3.6.4.9 We split each dataset into 66% for training and 34% for testing. We conduct six experiments to classify for six features that we need to include in our lexical database. For nominals we predict which continuation class (or inflection path) each nominal is likely to take. We predict the grammatical feature of humanness. We also classify nominals into nouns 9

112

M. Attia et al.

and adjectives. As for verbs, we classify them according to transitivity, and whether or not to allow the inflection for the passive voice and the imperative mood. Table 3 gives the results of the experiments in terms of precision, recall and f-measure. The results show that the highest f-measure scores were achieved for ‘Human’, ‘POS’ and ‘Transitivity’. Typically one would assume that these features are hard to predict with any reasonable accuracy without taking the context into account. So, it was surprising to get such good prediction results based only on statistics of morphological features. We could assume that the ‘clitics’ feature provides some clues about the context. However, removing the ‘clitics’ feature results in a drop from 0.86 to 0.83 in f-Measure for ‘Human’, which is not a big drop. This means that Arabic morphological features are powerful predictors of what the entry is likely to be with regards to unknown features. We also note that the f-measure for ‘Continuation Classes’ is comparatively low, but considering that here we are classifying for 13 features, we assume that the results are acceptable. 3.4 An Open-Source FST Arabic Morphological Analyser The Xerox XFST System [27] is a well-known finite state compiler, but the disadvantage of this tool is that it requires a license to access full functionality, which limits its use in the larger research community. Fortunately, there is an attractive alternative to the Xerox compiler, namely Foma [38], an open-source finite-state toolkit that implements the Xerox lexc and xfst utilities. Foma is largely compatible with the Xerox/PARC finite-state tools. It also embraces Unicode fully and supports a number of operating systems. We have developed an opensource morphological analyser for Arabic10 using the Foma compiler allowing us to easily share and distribute our morphology to third parties. The database, which is being edited and validated using the AraComLex tool, is used to automatically extend and update the morphological analyser, allowing for greater coverage and better capabilities. In this section we explain our system design and report on evaluation results. 3.4.1 System Design and Description. There are three main strategies for the development of Arabic morphological analysers depending on the initial level of analysis: root, stem or lemma. In a root-based morphology, such as the Xerox Arabic Morphological Analyser [4], analysing Arabic words is based on a list of roots and a list of patterns interacting together in a process called interdigitation, as explained earlier. In a stem-based morphology, such as SAMA [3,6], the stem is considered as a base form of the word. A stem is a form between the lemma and the surface form. One lemma can have several variations when interacting with prefixes and suffixes. Such a system does not use alteration rules and relies instead on listing all stems (or form variations) in the database. For example, in SAMA’s database, the verb )

šakara ‘to thank’ has two entries: ) šakara for perfective and ) škur for the imperfective. In a lemma-based morphology words are analysed at the lemma level. A lemma is the least marked form of a word, that is the uninflected word without suffixes, prefixes, proclitics or enclitics. In Arabic, this is usually the perfective, 3rd person, singular verb, and in the case of 10

A Lexical Database for MSA Interoperable with an FST Transducer

113

nouns and adjectives, the singular indefinite form. In a lemma-based morphology there is only one entry for the verb ) šakara, for example, that is the perfective form. The imperfective along with other inflected forms are generated from the lemma through alteration rules. In our implementation we use the lemma as the base form. We believe that a lemmabased morphology is more economical than the stem-based morphology as it does not list all form variations and relies on generalised rules. It is also less complex than the root-based approach and less likely to overgenerate [1,2]. This leads to better maintainability and scalability of our morphology. In a standard finite state system, lexical entries along with all possible affixes and clitics are encoded in the lexc language which is a right recursive phrase structure grammar [4,27]. A lexc file contains a number of lexicons connected through what is known as “continuation classes” which determine the path of concatenation. In example (1), the lexicon ‘Proclitic’ has a form ‘wa’ which has a continuation class ‘Prefix’. This means that the forms in ‘Prefix’ will be appended to the right of ‘wa’. The lexicon ‘Proclitic’ also has an empty string, which means that ‘Proclitic’ is optional and that the path can proceed without it. The bulk of lexical entries are listed under ‘Root’ in the example.

! With inflections and concatenations, words usually become subject to changes or alterations in their forms. Alterations are the discrepancies between underlying strings and their surface realisations [26], and alteration rules are the rules that relate the surface forms to the underlying forms. In Arabic, long vowels, glides and the glottal stop are the subject of a great deal of phonological (and consequently orthographical) alterations like assimilation and deletion. Many of the challenges an Arabic morphological analyser faces are related to handling these issues. In our system there are about 130 replace rules to handle alterations that affect verbs, nouns, adjectives and function words when they undergo inflections or are attached to affixes and clitics. Alteration rules are expressed in finite state systems using XFST replace rules of the general form shown in (2).

" #$ %% & The rule states that the string is replaced with the string when occurs between the left context and the right context .

114

M. Attia et al.

In our system, nouns are added by choosing from a template of continuation classes which determine what path of inflection each noun is going to select, as shown in example (3) (gloss is included in square brackets for illustration only).

'

()() ** *+[‘teacher’]**

+),-+)-+).,.

()() ** , ** +),-+)-+). - #.[‘student’] ()( ) ** % # /[‘book’]** ,- ()( ) ** ) /[‘notebook’]** /+).

These continuation class templates are based on the facts in table 2 above, which shows the inflection choices available for Arabic nouns according to gender (masculine or feminine) and number (singular, dual or plural). As for verbs in our lexc file, the start and end of stems are marked to provide information needed in conducting alteration operations, as shown in example (4). The tags are meant to provide the following information: – The multi-character symbol ** stands for stem start and ** for stem end. – The flag diacritic 0/1210 means “disallow the passive voice”, 0/1,10 means “disallow the imperative mood”. – 3 4 and 4 are the continuation classes for verbs.

5 2 **) [‘thank’]**

3 4

** 0)[‘be-happy’]**0/1210 4 1 **)+ [‘order’]**0/1,10 3 4 ** (#[‘say’]** 4 3.4.2 Morphology Evaluation. In this section we test the coverage and rate of analyses per word in our morphological analyser compared to an earlier version (the baseline) and SAMA. We build a test corpus of 800,000 words, divided into 400,000 of what we term as Semi-Literary text and 400,000 for General News texts. The Semi-Literary texts consist of articles collected from columns, commentaries, opinions and analytical essays written by professional writers who tend to use figurative and metaphorical language not commonly used in ordinary news. This type of text exhibits the characteristics of literary text, especially the high ratio of word tokens to word types: out of the 400,000 tokens there are 60,564 types. The General News text contrasts with the literary text in that the former has a lower ratio of word tokens to word types: out of the 400,000 tokens there are 42,887 types. This observation is similar to the finding of Jacob Landau in his book A Word Count of Modern Arabic Prose [20] where he conducted a word count on 136,000 running words from Arabic prose and an equal portion from the daily press. The former resulted in over 11,000 unique words and the latter close to 6,000.

A Lexical Database for MSA Interoperable with an FST Transducer

115

Table 4. Coverage and Rate per word test results Morphology

No. of

General News

Semi-Literary

Lemmas

Coverage

Rate per word

Coverage

Rate per word

Baseline

10,799

79.68%

1.67

69.37%

1.62

AraComLex

28,807

86.89%

2.10

85.14%

2.09

SAMA

40,648

88.13%

5.32

86.95%

5.3

Table 4 compares the coverage and rate per word results for AraComLex against the baseline, that is the morphology developed in [2], and LDC’s SAMA, version 3.0. The results show that for Semi-Literary texts we achieve a considerable improvement in coverage in AraComLex over the baseline; rising from 69.37% to 85.14%, that is 15.77% absolute improvement. Yet, for the General News texts, we achieve less improvement: from 80% to 87% coverage, that is 7% absolute improvement. Compared to SAMA, AraComLex has 1.24% (absolute) less coverage on General News, and 1.81% (absolute) less coverage on the Semi-Literary texts. At the same time we notice that the average rate of analyses per word (ambiguity rate) is significantly lower in AraComLex (2.1) than in SAMA (5.3). Testing shows that we achieve coverage results comparable to SAMA’s morphology, and our ambiguity level (rate of analyses per word) is about 60% lower than SAMA. We assume that the lower rate of ambiguity in AraComLex is mainly due to the fact that we excluded obsolete words and morphological analyses from our lexical database.

4 Future Work In extending our lexical database, we have been dependent mainly on SAMA. We filtered the SAMA lexicon through open web search engines and corpus data preannotated with MADA. In our corpus there are more than 700,000 types (or unique words) that are not recognised by SAMA. Now we need to devise a methodology to validate and include stable lemmas not included in SAMA’s database. This will entail using a morphological guesser and applying heuristics. We will also include a large list of named entities [39] and multiword expressions [40] in a separate database that can be automatically embedded in our morphological analyser.

5 Conclusion We build a lexicon for MSA that provides the information necessary for the construction of a morphological analyser, independent of design, approach, implementation strategy and programming language. We focus on the problem that existing lexical resources tend to include a subset of obsolete analyses and lexical entries, no longer attested in MSA. We start off with a manually constructed lexicon of 10,799 MSA lemmas and automatically extend it using lexical entries from SAMA’s lexical database, carefully

116

M. Attia et al.

excluding obsolete entries and analyses. We use machine learning on statistics derived from a large pre-annotated corpus for automatically extending and complementing the SAMA-based lexical information, resulting in a lexicon of 28,807 lemmas for MSA. We follow the LMF standard for lexical representation, which aims at facilitating the interoperability and exchange of data between the lexicon and NLP applications. We develop a lexicon authoring system, AraComLex, to aid the manual revision of the lexical database by lexicographers. We use the database to automatically update and extend an open-source finite state morphological transducer. Evaluation results show that our transducer has coverage similar to SAMA, but at a significantly reduced average rate of analysis per word, due to avoiding outdated entries and analyses. Acknowledgments. This research is funded by Enterprise Ireland (PC/09/037), the Irish Research Council for Science Engineering and Technology (IRCSET), and the EU projects PANACEA (7FP-ITC-248064) and META-NET (FP7-ICT-249119).

References 1. Dichy, J., Ali, F.: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? In: The MT-Summit IX Workshop on Machine Translation for Semitic Languages, New Orleans (2003) 2. Attia, M.: An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modelling Finite State Networks. In: Challenges of Arabic for NLP/MT Conference. The British Computer Society, London (2006) 3. Buckwalter, T.: Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue numberLDC2004L02,ISBN1-58563-324-0 (2004) 4. Beesley, K.R.: Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001. In: The ACL 2001 Workshop on Arabic Language Processing: Status and Prospects, Toulouse, France (2001) 5. Sinclair, J.M. (ed.): Looking Up: An Account of the COBUILD Project in Lexical Computing. Collins, London (1987) 6. Maamouri, M., Graff, D., Bouziri, B., Krouna, S., Kulick, S.: LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. LDC Catalog No. LDC2010L01 (2010) ISBN: 1-58563555-3 7. Bin-Muqbil, M.: Phonetic and Phonological Aspects of Arabic Emphatics and Gutturals. Ph.D. thesis in the University of Wisconsin, Madison (2006) 8. Watson, J.: The Phonology and Morphology of Arabic. Oxford University Press, New York (2002) 9. Elgibali, A., Badawi, E.M.: Understanding Arabic: Essays in Contemporary Arabic Linguistics in Honor of El-Said M. Badawi. American University in Cairo Press, Egypt (1996) 10. Fischer, W.: Classical Arabic. In: The Semitic Languages. Routledge, London (1997) 11. Van Mol, M.: Variation in Modern Standard Arabic in Radio News Broadcasts, A Synchronic Descriptive Investigation in the use of complementary Particles. Leuven, OLA 117 (2003) 12. Stetkevych, J.: The modern Arabic literary language: lexical and stylistic developments. Publications of the Center for Middle Eastern Studies, vol. (6). University of Chicago Press, Chicago (1970) 13. Owens, J.: The Arabic Grammatical Tradition. In: The Semitic Languages. Routledge, London (1997)

A Lexical Database for MSA Interoperable with an FST Transducer

117

14. Ghazali, S., Braham, A.: Dictionary Definitions and Corpus-Based Evidence in Modern Standard Arabic. In: Arabic NLP Workshop at ACL/EACL, Toulouse, France (2001) 15. Lane, E.W.: Preface. In: Arabic–English Lexicon. Williams and Norgate, London (1863) 16. Arberry, A.J.: Oriental essays: portraits of seven scholars. George Allen and Unwin, London (1960) 17. Wehr, H., Cowan, J.M.: Dictionary of Modern Written Arabic, pp. VII-XV. Spoken Language Services, Ithaca (1976) 18. Brill, M.: The Basic Word List of the Arabic Daily Newspaper. The Hebrew University Press Association, Jerusalem (1940) 19. Ku˘ocera, H., Francis, W.N.: Computational Analysis of Present-Day American English. Brown University Press, Providence (1967) 20. Landau, J.M.: A Word Count of Modern Arabic Prose. American Council of Learned Societies, New York (1959) 21. Al-Sulaiti, L., Atwell, E.: The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics 11 (2006) 22. Hajiˇc, J., Smrž, O., Buckwalter, T., Jin, H.: Feature-Based Tagger of Approximations of Functional Arabic Morphology. In: The 4th Workshop on Treebanks and Linguistic Theories (TLT 2005), Barcelona, Spain (2005) 23. Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford (2008) 24. Van Mol, M.: The development of a new learner’s dictionary for Modern Standard Arabic: the linguistic corpus approach. In: Heid, U., Evert, S., Lehmann, E., Rohrer, C. (eds.) Proceedings of the Ninth EURALEX International Congress, Stuttgart, pp. 831–836 (2000) 25. Boudelaa, S., Marslen-Wilson, W.D.: Aralex: A lexical database for Modern Standard Arabic. Behavior Research Methods 42(2) (2010) 26. Beesley, K.R.: Arabic Morphological Analysis on the Internet. In: The 6th International Conference and Exhibition on Multilingual Computing, Cambridge, UK (1998) 27. Beesley, K.R., Karttunen, L.: Finite State Morphology: CSLI studies in computational linguistics. CSLI, Stanford (2003) 28. Kiraz, G.A.: Computational Nonlinear Morphology: With Emphasis on Semitic Languages. Cambridge University Press, Cambridge (2001) 29. Parker, R., Graff, D., Chen, K., Kong, J., Maeda, K.: Arabic Gigaword Fourth Edition. LDC Catalog No. LDC2009T30 (2009) ISBN: 1-58563-532-4 30. Habash, N., Rambow, O., Roth, R.: MADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization. In: The 2nd International Conference on Arabic Language Resources and Tools (MEDAR 2009), Cairo, Egypt, pp. 102–109 (2009) 31. Habash, N., Rambow, O.: Arabic Tokenization, Morphological Analysis, and Part- of-Speech Tagging in One Fell Swoop. In: Proceedings of the Conference of American Association for Computational Linguistics (ACL 2005). The University of Michigan, Ann Arbor (2005) 32. Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In: Proceedings of Association for Computational Linguistics (ACL), Columbus, Ohio (2008) 33. Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., Soria, C.: Multilingual resources for NLP in the lexical markup framework (LMF). Language Resources and Evaluation (2008) ISSN 1574-020X 34. ISO 24613: Language Resource Management Lexical Markup Framework (draft version), ISO Switzerland (2007) 35. Chen, P.P.: The Entity-Relationship Model: Toward a Unified View of Data. ACM Transactions on Database Systems 1, 9–36 (1976)

118

M. Attia et al.

36. Rosenblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington DC (1961) 37. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Englewood Cliffs (1998) 38. Hulden, M.: Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Association for Computational Linguistics, Stroudsburg (2009) 39. Attia, M., Toral, A., Tounsi, L., Monachini, M., van Genabith, J.: An automatically built Named Entity lexicon for Arabic. In: LREC 2010, Valletta, Malta (2010) 40. Attia, M., Toral, A., Tounsi, L., Monachini, M.: van Genabith. Automatic Extraction of Arabic Multiword Expressions. In: COLING 2010 Workshop on Multiword Expressions: from Theory to Applications, Beijing, China (2010)

Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus Septina Dian Larasati, Vladislav Kuboˇn, and Daniel Zeman Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

Abstract. This paper describes a robust finite state morphology tool for Indonesian (MorphInd), which handles both morphological analysis and lemmatization for a given surface word form so that it is suitable for further language processing. MorphInd has wider coverage on handling Indonesian derivational and inflectional morphology compared to an existing Indonesian morphological analyzer [1], along with a more detailed tagset. MorphInd outputs the analysis in the form of segmented morphemes along with the morphological tags. The implementation was done using finite state technology by adopting the two-level morphology approach implemented in Foma. It achieved 84.6% of coverage on a preliminary stage Indonesian corpus where it mostly fails to capture the proper nouns and foreign words as expected initially.

1 Introduction Indonesian, or Bahasa Indonesia as the locals would call it, is the official language of Indonesia. The language is spoken by approximately 230 million people throughout the country with only 23 million native speakers. Typologically, the language could be partially classified as isolating and partially as agglutinative. Language technology research on this language has been quite enthusiastic in recent years but without having a well developed continuous long term plan. There are many language tools such as a parser, a semantic analyzer and a speech recognition tool. Our Indonesian Morphology tool, MorphInd, was intended to set up proper ground work before doing any further language processing. MorphInd is applied to enrich a raw Indonesian text with morphological information, a preprocessing stage of developing an Indonesian corpus. MorphInd was inspired by an existing Indonesian morphological analyzer tool [1] (hereinafter called IndMA), where we found that the analysis produced was inadequate. More on this matter described further in section 2. MorphInd introduces a more fine-grained tagset compared to IndMA and gives the output in form of segmented morphemes as an added value. In addition to that, the lemmata are also tagged independently for lemmatization purposes. The goal of the work described in this paper is to have an Indonesian morphology tool which has broader morphological and lexical coverage, provided with richer and less ambiguous linguistic information in the analysis, and tested on common Indonesian C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 119–129, 2011. c Springer-Verlag Berlin Heidelberg 2011

120

S.D. Larasati, V. Kuboˇn, and D. Zeman

text. The work includes: the design of the new tagset to cope with Indonesian morphological phenomena; the format of the analysis output which includes constructing the morphemic segmentation format and lemma marking; and a better organization of the lexical categories. The coverage of the tool is then evaluated on an Indonesian corpus which consists of text coming from different domains.

2 Motivation The work in Indonesian Morphology was done over a long period. There was previous work on developing an Indonesian stemmer [2,3]. The limitation of these tools is that they only recover the root of an affixed surface form without any additional linguistic information, which could be encoded by the occurence or combination of morphemes. Then an initial version of a morphological analyzer [4] developed in PC-KIMMO was introduced. Unfortunately, reduplication, which is one of Indonesian morphology’s crucial points was not yet covered by the tool. The latest work on the morphological analyzer is a finite state tool [1] implemented on XFST [5], a commercial finite state technology (FST) toolkit. The tool is able to handle most of Indonesian morphosyntactic and morphophonemic phenomena. In spite of how robust it models the morpheme’s composition, the linguistic information that it produces in the analysis was rather simple and ambiguous. Although it was developed in a FST environment, the reduplication and affixed reduplication are also covered by the tool (more about this matter in section 3.4) We decided IndMA was a good starting point to develop MorphInd, which is basically a refinement of IndMA, although there were some major changes on the finite state architecture that was taken when we ported the rules. Those changes are intended to make it more organized for further development. Here we point out four issues that we found as the limitations of IndMA, which we refined in MorphInd. Issue #1. Shallow Lexical Categorizations. It was designed with only a simple tagset that consists of four major different lexical tags namely ‘Noun’, ‘Verb’, ‘Adjective’, and ‘Etc’ plus several additional language feature tags. This shallow categorization is not adequate to be passed onto another tool such as a parser, where categories such as ‘Numeral’, ‘Adverb’ and many others play an important role. Issue #2. Underspecified Analysis. The output is in the form of a lemma followed by morphological tags. There are some problems of underspecified analysis since the output is in a simple form with a limited tagset. There is the same analysis for different word derivations. Figure 1 shows examples of the verb v.kirim (send/deliver) derivation, where several derived words have the same lemma and the derived words falls in the same lexical category. On the other hand, the generation step of the analysis into the surface forms, outputs many varieties of morpheme combinations allowed by the finite state network, where many of them are invalid surface word forms (see figure 2).

Indonesian Morphology Tool (MorphInd) Input v. kirim n. kiriman n. pengirim n. pengiriman

(v. send/deliver) (n. packages) (n. deliverer) (n. delivery)

121

Output

Fig. 1. IndMA analysis examples Input Output

n. kiriman (n. packages)

n. pengirim (n. deliverer)

n. pengiriman (n. delivery)

*pemberkiriman

*perkiriman

*kerberkiriman

*kekiriman * invalid surface forms Fig. 2. IndMA generation examples

Issue #3. Morphosyntactic Rules. The morphosyntactic rules that were defined in IndMA cover almost all possible cases in Indonesian, disregarding the exceptionals, which are not trivial to solve. But there are more morphosyntactic cases which are trivial to solve, that are not covered by IndMA, such as clitics. Issue #4. Software license. IndMA was developed on XFST, which is a commercial finite-state automata and transducer. The tool uses a patent encumbered function which does the non-concatenative morphology operation for the reduplication, therefore the overall software cannot be used freely. The aim for MorphInd is to make it available for any individuals who want to utilize or refine the tool.

3 Tool Design MorphInd was designed to address the four issues that were previously mentioned. MorphInd produces analysis that only covers morphology phenomena; it does not handle syntax, but its output can be used as input to many other Natural Language Processing (NLP) tasks. MorphInd analyzes tokens as unigrams and does not take into account any neighbouring tokens. MorphInd does not return any syntactical functions on the analyses, although some functions are easily recognized by the word order or the clitics. For example, we do not mark the ‘subject’ of the sentence where it can be easily recognized by a common proclitic that is attached to a verb, but the fact that the surface word form has a pronoun proclitic is kept in the analyses. We decided to do this since this processing will be the task of a parser. In a more complex system, MorphInd can be used as one of the modules that gives morphological tags before parsing.

122

S.D. Larasati, V. Kuboˇn, and D. Zeman

3.1 Tagset Design and Lexical Category Organization MorphInd organizes the lexical entries into 17 different lexical categories. Those categories are basically ‘Noun’, ‘Verb’, ‘Adjective’ as in IndMA and we broke down ‘Etc’ into several lexical categories such as ‘Preposition’ and ‘Modal’, where most of these categories are closed word classes with entries that are easy to enumerate manually. These categories correspond to a lemma lexical category tag, which tag the lemma for lemmatization purposes. MorphInd also has a fine-grained tagset which was inspired by the PENN Treebank tagset and adapted it accordingly to Indonesian morphology. The tagset also adopts the concept of positional tags of the Prague Dependency Treebank tagset to cope with most of the language behaviours that occur simultaneously in a surface word. The tagset contains morphological tags in three positions and a lemma tag. The first position reflects the actual lexical category of the surface word, while the second and third tag positions are there to give more specific linguistic information. Table 1 gives the complete tagset. 3.2 Analysis Format We decided to make the output in the form of segmented morphemes, which it shows how the morphemes combined. This will make the output more precise and less ambiguous for the generation step. The surface word form was segmented to its morphemes. The lemma is directly followed by a lemma tag, which corresponds to the first position of the word form tag, and that they are distinguished by lowercase. Lemma tag can differ from the first position of the same token, because of derivation (see figure 3). This format will make it easier to extract the lemma if needed (e.g., ). Then the sequence of the whole segmented morphemes including the lemma tag are followed by morphological tags as described in the tagset. Clitics, as they stand as independent words semantically, are treated as a single word form which has its own analysis but they are glued in the surface word’s overall analysis as one of the morphemes. In this way, the fact that the morpheme was clitic is still kept in the output. Figure 3 shows several word derivation output examples and a word phrase of the lemma v. kirim (v. send/deliver) with clitics. 3.3 Morphosyntactic and Morphophonemic Operation Indonesian is not an inflected language as Slavic languages are, although there are several morphemes that bring language features such as verb conjugation to mark active Input v. kirim v. mengirim n. kiriman n. pengiriman ph. kumengirimkannya

(v. send/deliver) (v. send/deliver) (n. package) (n. delivery) (ph. I send/deliver him/her)

Output

!"!# $ !"!%

Fig. 3. MorphInd Derivation Analysis Examples

Indonesian Morphology Tool (MorphInd)

123

Table 1. MorphInd Tagset 1st position

2nd position

3rd position

NN Noun NNP Proper noun

PL Plural SG Singular

F Feminine M Masculine D Non-Specified

PRP Personal pronoun PL Plural SG Singular

1 2 3

VB

Verb

PL Plural SG Singular

AV Active Voice PV Passive Voice

CD

Numeral

C Cardinal Numeral O Ordinal Numeral D Collective Numeral

CC

Coordinating conjunction

SC

Subordinative conjunction

JJ

Adjective

FW IN MD DT RB RP NEG UH COP WH UNK

Foreign word Preposition Modal Determiner Adverb Particle Negation Interjection Copula Question Unknown

P S

First Person Second Person Third Person

Positive Superlative

and passive voices or noun declination to mark the gender (this inflection is not produced anymore and doesm not relate to the grammar such as word gender agreement). Indonesian is a mildly agglutinative language when compared to Finnish or Turkish where the morpheme-per-word ratio is higher. There are several common subject or object pronouns of the sentence event that can be represented as clitics (proclitic and enclitic). Most of the morphological phenomena are word derivational cases done by concatenative affixation operations (prefix, suffix, circumfix, and infix). These affixation examples can be seen in figure 3. MorphInd handles infixations differently by putting the surface word as one of the entries in the dictionary, since infixations are not common anymore in Indonesian. For example the word n. gerigi (n. teeth), which it is the word n. gigi (n. tooth) with er infix in g+er+igi arrangement, are defined in the dictionary and marked as plural. The word

124

S.D. Larasati, V. Kuboˇn, and D. Zeman Input n. gerigi n. gigi-gigi

(n. teeth) (n. teeth)

n. 2 buku

(n. 2 books) (lit n. *2 book) n. dua buku (n. two books) (lit n. *two book) n. buku-buku (n. books) n. *2 buku-buku (lit n. two books)

Output & &!' &&!'

($)*) $$)*) !' ($)*) !'

Fig. 4. MorphInd plural form examples Input num. 2 num. dua num. ke-2 num. kedua

(num. 2) (num. two) (num. second) (num. second)

Output ($)*) $$)*)

($)*+

$$)*+

Fig. 5. MorphInd numeral alternation examples

n. gerigi is not common anymore and has the word n. gigi-gigi as its equivalent word. Both analyses of the examples can be seen at figure 4. There are no feature agreements except numerical agreement for a noun to have singular form if preceeded by a plural numeral e.g., dua buku (lit. *two book). In this case MorphInd only works in the level of single word tokens and does not capture the plurality of the whole phrase. Given in figure 5 are also examples of numeral alternations. Deriving Nouns, Adjectives, Verbs, and Numerals are the most productive derivational morphosyntactic and morhophonemic operations. It also includes the non-concatenative morphology operation i.e. reduplication that occurs to mark the plural mood. We designed the finite state architecture into a more organized way, seperating the alternation based on those categories and on their affixation segments. The schema (without reduplication) is provided in table 2. We reuse the morphophonemic rules from IndMA since those rules cover most of the cases. We ported and organized all the morphosyntactic rules. In addition to that, we added more rules, such as more affix concatenative rules, handling the clitics (proclitics and enclitics), additional particles (e.g., -lah, -kah, -tah, and -pun), and several additional compound word morphemes (e.g., antar- and anti-). The general MorphInd finite state schema can be found in figure 6. 3.4 Software License IndMA uses the function provided by XFST to handle reduplication. It copies the marked morpheme that is going to be reduplicated during the compilation. This function is patent-encumbered which limits the usage of the tool. To loosen

Indonesian Morphology Tool (MorphInd)

125

Table 2. Nouns, adjectives, verbs, and numerals alternation schema Preprefix Noun Alternation

Adjective Alternation

Verb Alternation Numeral Alternation

Prefix

ε+ anti+ antar+

ε+ peN+ ke+ per+ ke+tidak+

ε+ non+

ε+ ter+ ke+ se+ ε+ meN+ di+ ber+

Lemma

[lemma]

+ε +an +wan +wati

[lemma]

+ε +an +nya

[lemma]

+ε +kan +i

ε+ per+

ε+ ke+ ber+

Suffix

[lemma]

+ε +nya +belas

Fig. 6. MorphInd general finite state architecture schema

the license we decided not to use that function but tweaked the reduplication process. We also use Foma toolkit [6] instead of XFST to compile the tool so that MorphInd is suitable to fall into an Open Source license. Foma, which falls under GNU General Public License, works in the similar way as XFST and accept XFST/LEXC code therefore several parts of the source code of IndMA can be easily reused as needed. The tweaking is done by pairing all the marked morphemes with anything and discards all the pairs which are not similar. This causes the finite state network compilation time and memory consumption to explode.

126

S.D. Larasati, V. Kuboˇn, and D. Zeman

To handle this we limit the lexical entry size by splitting it into several parts and compile it as separate networks. All the resulting finite state networks then are wrapped together by a Perl script to build the tool.

4 Research on Indonesian 4.1 Linguistic Tools The work on developing language resources for Indonesian is not enthusiastic compared to the work on developing lingustics tools. There were works on developing an Indonesian online dictionary [7] but its resources are not freely available. The entries are equipped with linguistic and anthropological information. There is also a project on developing an Indonesian wordnet [8] that is still ongoing. While on the other hand, development of Indonesian linguistics tools are surprisingly popular and done with different approaches. Beside works on a morphological analyzer, there are also works on developing an Indonesian probabilistic part-of-speech tagger [9,10]. On the syntactic level, there are works on developing an Indonesian rule-based parser using PC-PATR [11], which relies on annotated lexical entries. Later, this tool is also being used to model a probabilistic parser learned from the parsed trees that it produces [12]. Even though the ground work for further processing is not properly established yet, it does not stop the researchers from trying to make semantic tools. There are also some work on semantics such as semantic analyzers [13,14]. 4.2 Indonesian Corpus Plan Since there are no available Indonesian lingusitic corpora, we initiatively collected Indonesian texts and prepared them for further linguistic processing. Although it is not required to be a parallel corpus, we prefer to have Indonesian text that is aligned with English text. Mainly we collected the texts from the PAN Localization project output [15] and subtitles. Currently the Indonesian part consists of 45,011 sentences. The statistic of the corpus sources are given in figure 7. For the initial plan, the final corpus will be in XML format following the PML schema [16] with several different layers such as morphology, syntax, etc. MorphInd will fill the morphology layer of the corpus. As the plan continues we are hoping to have an Indonesian-English parallel treebank corpus.

5 Evaluation Test Set. We ran MorphInd and IndMA on Indonesian text that we have collected to measure the coverage. We made two types of test sets i.e. 5,000 sentences (T5K) and 10,000 sentences (T10K). There are nine sets of T5K and four sets of T10K. The sentences in a test set were chosen randomly without replacement from the text that we have collected (see section 4.2).

Indonesian Morphology Tool (MorphInd)

127

Fig. 7. Indonesian Parallel Corpus Source Statistic

Table 3. MorphInd lexical entries Noun 2,222 Verb 924 Personal pronoun 13 Coordinating Conjunction 3 Subordinating Conjunction 32 Determiner 15

Numeral Adjective Foreign word Preposition Modal Adverb

19 576 0 16 6 89

Particle 4 Negation 3 Interjection 11 Copula 3 Question 6 TOTAL 3,942

Metric. We used coverage as our metric. The coverage is measured in two ways, overall and unique. Overall is the ratio of the number of words analyzed and the number of the words in the text. Unique is the ratio of different word forms analyzed and the number of different word forms in the text. Experiments. MorphInd consists of 3,954 lexical entries divided into 17 lexical categories. We did not port all the entries that are available in IndMA which has more entries but with several of them overlapped across the categories or in affixed forms. We also rebuilt IndMA to have the same lexical entries as MorphInd to make a comparable experiment (hereinafter called IndMA-Comparable). Detail of the tools’ lexical entries can be found in table 3 and 4. The resulting comparison of the three tools can be seen in table 5. MorphInd failed to outperform IndMA in Unique coverage since the number of lexical entries greatly differs and MorphInd lexical entries do not include proper nouns and foreign words. But with a good selection of the lexical entries, by chosing the

128

S.D. Larasati, V. Kuboˇn, and D. Zeman Table 4. IndMA and IndMA-comparable lexical entries IndMA IndMA-Comparable Noun Verb Adjective Etc

5,863 3,417 19,036 4,153

2,222 924 576 220

TOTAL

32,469

3,942

Table 5. Evaluation Test Sets

# Sentences

Overall

Unique

MorphInd

T5K T10K

5,000 10,000

84.69±0.28 84.61±0.10

50.77±0.70 47.19±0.35

IndMA

T5K T10K

5,000 10,000

83.62±0.27 83.46±0.06

54.95±0.76 51.39±0.05

IndMA-Comparable

T5K T10K

5,000 10,000

81.91±0.18 81.82±0.06

44.60±0.66 40.83±0.31

most frequent and productive lemmas, MorphInd’s Overall coverage became greater than IndMA. This is because MorphInd mainly covers clitics, numeral alternation, and additional particle morphemes which were not covered by IndMA. This can be easily seen on MorphInd’s and IndMA-Comparable’s results, where MorphInd had a better coverage with the same lexical entries.

6 Conclusion and Future Work MorphInd produces robust morphological information in the output format i.e. morphemic segmentation, lemma morpheme position, lexical category, and morphological feature. The new robust tagset with broader categorization that it uses is also suitable for a further language processing such as parsing. MorphInd gives a better coverage compared to IndMA. The most current version of MorphInd can be found at the MorphInd homepage which includes MorphInd documentation, binaries, and source code.1 Yet for future improvements, we will investigate more morpheme behaviour to add to MorphInd, such as morphoponemic affixation exceptions on one syllable words. As its initial plan, this tool will enrich the morphological layer of the Indonesian corpus. We also will build an initial parser based on MorphInd’s output. Acknowledgement. This project was financially supported by the grant LC536 Centrum Komputaˇcní Lingvistiky of the Czech Ministry of Education. 1

, -.. ./.0 ,1$,

Indonesian Morphology Tool (MorphInd)

129

References 1.

2.

3.

4.

5. 6.

7. 8.

9.

10. 11.

12. 13.

14.

15. 16.

Pisceldo, F., Mahendra, R., Manurung, R., Arka, I.W.: A Two-Level Morphological Analyser for Indonesian. In: Abstract Submitted to the Australasian Language Technology (ALTA) Workshop 2008, Tasmania (2008) Siregar, N.: Pencarian Kata Berimbuhan pada Kamus Besar Bahasa Indonesia dengan menggunakan Algoritma Stemming. Undergraduate thesis, Faculty of Computer Science, University of Indonesia (1995) Adriani, M., Jelita, A., Nazief, S.B., Tahaghoghi, M., Williams, H.: Stemming Indonesian: A Confix-Stripping Approach. ACM Transactions on Asian Language Information Processing 6(4) (2007) Hartono, H.: Pengembangan Pengurai Morfologi untuk Bahasa Indonesia dengan Model Morfologi Dua Tingkat Berbasiskan PC-KIMMO. Undergraduate thesis, Faculty of Computer Science, University of Indonesia (2002) Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI Publications, Palo Alto (2003) Hulden, M.: Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session, Athens, Greece, pp. 29–32 (2009) Bahasa, P.: Kamus Besar Bahasa Indonesia Daring (2008), , -.. ,$ &$. . (last access: February 14, 2011) Darma Putra, D., Arfan, A., Manurung, R.: Building an Indonesian Wordnet. In: The Second International MALINDO Workshop (2008), , -..,$.2$ . (last access: February 14, 2011) Pisceldo, F., Manurung, R., Adriani, M.: Probabilistic Part-of-Speech Tagging for Bahasa Indonesia. In: The Third International MALINDO Workshop, Colocated Event ACL-IJCNLP 2009, Singapore, August 1 (2009) Farizki Wicaksono, A., Purwarianti, A.: HMM Based Part-of-Speech Tagger for Bahasa Indonesia. In: The Fourth International MALINDO Workshop, Jakarta, Indonesia (2010) Joice: Pengembangan lanjut pengurai struktur kalimat bahasa indonesia yang menggunakan constraint-based formalism. Undergraduate thesis, Faculty of Computer Science, University of Indonesia (2002) Hari Gusmita, R., Manurung, R.: Some Initial Experiments with Indonesian Probabilistic Parsing. In: The Second International MALINDO Workshop (2008) Dian Larasati, S., Manurung, R.: Towards a Semantic Analysis of Bahasa Indonesia for Question Answering. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, PACLING (2007) Mahendra, R., Dian Larasati, S., Manurung, R.: Extending an Indonesian Semantic Analysisbased Question Answering System with Linguistic and World Knowledge Axioms. In: Proceedings of the 22nd Pacific Asia Conference on Language, Information, and Computation (PACLIC 2008), pp. 262–271 (2008) PAN Localization, , -..222 #3 . &,.+ 1$ (, (last access: February 14, 2011) Prague Markup Language (PML), , -.. .4. .$ 56 , (last access: February 14, 2011)

Morphology Generation for Swiss German Dialects Yves Scherrer LATL Université de Genève Genève, Switzerland

Abstract. Most work in natural language processing is geared towards written, standardized language varieties. In this paper, we present a morphology generator that is able to handle continuous linguistic variation, as it is encountered in the dialect landscape of German-speaking Switzerland. The generator derives inflected dialect forms from Standard German input. Besides generation of inflectional affixes, this system also deals with the phonetic adaptation of cognate stems and with lexical substitution of non-cognate stems. Most of its rules are parametrized by probability maps extracted from a dialectological atlas, thereby providing a large dialectal coverage. Keywords: Swiss German, dialects, cognate words, morphological generation.

1 Introduction Most work in natural language processing is geared towards written, standardized language varieties. This focus is generally justified on practical grounds of data availability and socio-economical relevance, but does not always reflect the linguistic reality of substandard varieties. In this paper, we present a morphology generator that is able to handle continuous linguistic variation, as it is encountered in various dialect landscapes. The work presented here is applied to Swiss German dialects; these dialects are well documented by dialectological research and are among the most vital ones in Europe in terms of social acceptance and media exposure. The task of Swiss German word generation can be formulated as follows: Given a Standard German root and a set of morphosyntactic features, generate all inflected forms that are valid in the different Swiss German dialects. Our approach can be qualified as cross-lingual and multi-dialectal. It is cross-lingual in the sense that the language variety of the input root (Standard German) is different from the language variety of the output forms (Swiss German). It is multi-dialectal because it aims to generate all forms that occur in the different dialects of Germanspeaking Switzerland, relying on existing dialectological resources. Hence, the proposed system is more than just a morphological generator: it is a word translation engine that relies on the numerous structural similarities between Standard German and Swiss German. In the following section, we will briefly describe some linguistic characteristics of Swiss German dialects. In section 3, the general system architecture will be described C. Mahlow and M. Piotrowski (Eds.): SFCM 2011, CCIS 100, pp. 130–140, 2011. c Springer-Verlag Berlin Heidelberg 2011

Morphology Generation for Swiss German Dialects

131

and illustrated with examples. Section 4 will present some problematic cases that arise from the specific multi-dialectal conception of our model. We show some coverage figures in section 5, and we conclude in section 6.

2 Swiss German Dialects The German-speaking area of Switzerland encompasses the Northeastern two thirds of the Swiss territory. Likewise, about two thirds of the Swiss population declare (any variety of) German as their first language. It is usually admitted that the sociolinguistic configuration of German-speaking Switzerland is a model case of diglossia, i.e., an environment in which two linguistic varieties are used complementarily in functionally different contexts. In German-speaking Switzerland, dialects are used in speech, while Standard German is used nearly exclusively in written contexts. Despite the preference for spoken dialect use, written dialect use has become popular in electronic media like blogs, SMS, e-mail and chatrooms. The Alemannic Wikipedia contains about 6000 articles, among which many are written in a Swiss German dialect.1 However, all this data is very heterogeneous in terms of the dialects used, spelling conventions and genres. The classification of Swiss German dialects is commonly based on administrative and topographical criteria. Although these non-linguistic borders have influenced dialects to various degrees, the resulting classification does not always match the linguistic reality. Our approach does not presuppose any dialect classification. We conceive of the Swiss German dialect area as a continuum in which certain phenomena show more clear-cut borders than others. The nature of dialect borders is to be inferred from the data.2 Swiss German has been subject to dialectological research since the beginning of the 20th century. One of the major contributions is the Sprachatlas der deutschen Schweiz (SDS), a linguistic atlas that covers phonetic, morphological and lexical differences. Data collection and publication were carried out between 1939 and 1997 [8]. There also exist grammars and lexicons for specific dialects (e.g., [4,11,6,5]), as well as general presentations of Swiss German [10,12]. On all levels of linguistic analysis, there are differences between Standard German and Swiss German, as well as among the various Swiss German dialects. The examples given in the following sections will illustrate some of these differences.

3 General System Architecture As sketched out above, the morphological generator for Swiss German takes as input a Standard German root and a set of features that determine the inflected form to be 1

2

See . Besides Swiss German, the Alemannic dialect group encompasses Alsatian, South-West German Alemannic as well as the Vorarlberg dialects of Austria. Nonetheless, we will refer to political entities for convenience when describing interdialectal differences in the following sections of this paper.

132

Y. Scherrer

generated. These features include part-of-speech tags, morphological tags like gender, number, person, as well as lexical information like inflection class. Full dialect forms are generated in two steps. First, a dialect root is obtained by applying phonetic and lexical transformations to the Standard German root3 . Second, the inflected dialectal form is obtained by adding affixes to the dialectal root, according to the feature set given in the input. For example, the Standard German verbal root such- ‘to search’ will trigger the following root transformation rules for Graubünden dialect: – u → u (not ü ) – u → ue – e → a (in diphthong) The result of the first step is thus the dialect root suach-. In order to generate the 3rd person plural form, the feature set will trigger the affixation rule with the Graubünden dialect suffix -end.4 The result of the second step is thus the inflected form suachend. In most common settings of morphology generation, the first step is not required. Morphology generators are usually conceived as monolingual tools, where the language variety of the input is the same as the language variety of the output. This contrasts with our setting, where the input root is not identical with the root of the output form: it can undergo phonetic and lexical transformations depending on the dialect. Hence, additional transformations have to be executed before even starting affix generation. This is a consequence of our cross-lingual approach. Our aim of multi-dialectal coverage leads to further complications. In most cases, one Standard German root will yield several dialect roots, each of which is valid in a different region. Likewise, dialectally different affixes can be added to each dialect root. Therefore, one generation query will usually yield a long list of candidate forms. However, all candidates are associated with maps, extracted from the SDS atlas, that describe the geographic area in which they are valid (see figure 1). These maps allow to prune the candidate lists according to a specific target dialect. Moreover, the first and second steps cannot always be clearly separated. In the case of irregular inflection or suppletive forms, the most efficient approach is to combine the two steps. Section 4.2 will show an example of this. The system presented here is implemented in the form of a database which contains different types of transformation rules, which are applied in cascade with the help of Python scripts. While this approach allows for easy debugging of the rule base, it is not as efficient as an implementation with finite-state transducers. Moreover, transducers could be used in both directions, for analysis and generation. Such an implementation is planned for future work. Our work does not currently use machine learning techniques. There are two main reasons for this methodological choice. First, the dialectological atlas used as primary resource already contains linguistically interpreted data: the legend of each map specifies the rule and its conditions of applicability in a relatively explicit way. By using 3 4

Here, we simply define the root of a word as identical to its citation form, except for verbs, where specific infinitive affixes are stripped off. We use the STTS tag set as defined for Standard German [15].

Morphology Generation for Swiss German Dialects

133

Fig. 1. Maps defining the distribution of the weak nominative singular adjective suffixes. Black surfaces represent high probabilities, white surfaces represent low probabilities. The i -suffix appears in Western dialects (black surface in the left map), the null-suffix appears in Eastern dialects (black surface in the right map). These maps have been digitized from the SDS map III/254.

hand-written rules, we can fully take advantage of these data. Second, machine learning approaches5 require a lot of training data, which are notoriously hard to find for small-scale language varieties like Swiss German dialects. The problem is exacerbated with our multi-dialect approach, where distinct training corpora would be required for each dialect. In the following subsections, we will present the architecture of the database that contains the transformation rules. 3.1 Variables and Variants The structure of our database relies on the dialectological distinction between the concepts of variable and variant [2, p. 49ff]. A variable is any linguistic phenomenon whose realisation varies along the geographical axis. The different realisations are called variants.6 For example, the suffix of weak nominative singular adjectives is a variable in Swiss German dialects. Its variants are the null-suffix in Eastern dialects (e.g., di schwarz Chatz ‘the black cat’), and the -i suffix in Western dialects (e.g., di schwarzi Chatz ). Each variant is associated with a probability map that shows its geographic extension. The maps are extracted from the SDS atlas [8]. The SDS maps contain discrete values at a limited number of inquiry points. These values are converted to a continuous surface by interpolation, such that the grey scale value at each pixel of the surface represents the probability of a variant at that pixel. For any variable, the maps of all its variants are complementary, i.e., the sum of the probability values of all maps at each pixel equals to 1. More details about the interpolation method can be found in [13] and [14]. Figure 1 shows the probability maps for the example given above. 5 6

For a recent overview of unsupervised learning techniques of morphology, see [7]. The distinction between variables and variants is a consequence of the multi-dialect generation approach. In a (deterministic) single-dialect system, there is a one-to-one mapping between variables and variants, which makes the distinction irrelevant.

134

Y. Scherrer

Table 1. Detail of the database tables for phonetic transformations. The example shows the rule 2-120-nd that transforms post-vocalic word-final nd into one of the four variants ng (Bern), nn (Fribourg), nt (Wallis), or nd (other dialects).

2-120-nd

[aeiouäöü](nd)$

101

2-120

2-120-nd 2-120-nd 2-120-nd 2-120-nd

ng nn nt nd

dp_ng dp_nn dp_nt dp_nd

3.2 Phonetic Transformations Most Swiss German words are cognates of Standard German words. Hence, regular phonetic7 transformations allow us to derive many Swiss German word roots from their Standard German counterparts. Phonetic transformation rules are stored in two database tables, and . Example entries are given in table 1.8 Each phonetic transformation variable is characterized by a name ( ), a regular expression that allows to identify its contexts of application ( ), an integer determining the rule order ( ), and a file system path in which the corresponding maps are to be found ( ). Each variable has one or more variants, which are linked by the attribute. Variants are defined by the string which replaces the group matched by the regular expression ( ) and the file name of the corresponding map ( ). Currently, 135 phonetic variables are implemented. They correspond to 314 variants. 3.3 Lexical Replacement Cognate words can be derived from Standard German stems with the help of the phonetic rules described above. However, there are cases where dialects use stems with a different etymological origin. In other cases, the dialectal form does have some vague phonetic resemblance with the Standard German counterpart, but which would be 7

8

In the presence of multiple dialects, it is somewhat difficult to distinguish phonetic from phonological phenomena. A specific sound difference may have phonemic value in one dialect, but not in another. Hence, the same sound law would be classified under phonetics in one dialect, and under phonology in another. In this paper, we use phonetic transformation as a generic term for both types of transformations. This table, as well as the following ones, is to be read as follows: The first line describes the structure of the table. The entries below the double line show examples.

Morphology Generation for Swiss German Dialects

135

Table 2. Detail of the database tables for lexical substitution. The first example shows the dialectal variants of Standard German nichts ‘nothing’, whose idiosyncratic behavior is difficult to capture with phonetic rules. Its first two entries generate the forms nüüt and nüt, which are further transformed to niit and nit with the help of the regular phonetic rules üü-ii and ü-i in some dialects. The second example refers to the translation of Standard German immer ‘always’, where completely different lexemes are used throughout the Swiss German dialect area.

nichts immer

nichts immer

PIS ADV

nichts nichts nichts nichts nichts immer immer immer

nüüt nüt nünt nütz nix immer geng all

dp_nüüt dp_nüt dp_nünt dp_nütz dp_nix dp_immer dp_geng dp_all

4-171-nichts 6-026-immer

üü-ii ü-i

difficult to capture with a phonetic rule. Therefore, we introduce lexical replacement rules, which are again defined in a table and a table. Examples are given in table 2. The table allows to specify the lemma of a word and its part-ofspeech tag, but can also contain finer-grained morphological information (field ). The latter is mainly used for irregular inflection patterns (see section 4.2). The table specifies a stem that completely replaces the Standard German stem. It also allows to modify morphological features in the field . This functionality is used for example to change the gender tag when a masculine noun lexeme is replaced by a feminine noun lexeme. The field allows to specify a subset of phonetic rules to be applied after lexical substitution. This functionality is illustrated in the first example (table 2), where two additional variants niit and nit exist. The distinction between ü -based and i -based variants is completely regular and already accounted for by a phonetic rule. Hence, this phonetic rule is added to the corresponding variants. Currently, the database contains 260 lexical variables and 559 lexical variants. Most of them are high-frequency adverbs, pronouns and irregular verbs. 3.4 Affix Generation The tables and define the inflectional affixes for regular noun, verb, adjective and pronoun inflection. While most rules deal with simple suffixation, more complex affixation types are also supported. Table 3 shows examples of noun plural formation.

136

Y. Scherrer

Table 3. Detail of the affix generation tables, showing selected rules for noun plurals. Rule n0uml adds umlaut to the stem vowel independently of the dialect chosen. Rules n1-e and n1-er add suffixes depending on the phonetic environment specified in . Rule n1-ene illustrates the use of dialect-dependent suffixes, with map information given in the fields and . All rules depend on inflection class information as specified by the NCl tags. The symbol ∼ is substituted by n when the following word starts with a vowel, and dropped otherwise.

n0-uml n1-e n1-er n1-ene

NN NN NN NN

Pl,NCl_uml,NCl_uml_er,NCl_uml_e Pl,NCl_e,NCl_uml_e Pl,NCl_uml_er Pl,NCl_ene 3-187-ene

n0-uml n1-e n1-er n1-ene n1-ene n1-ene

([^i])e?$ (er)?$ (i)$ (i)$ (i)$

umlaut \1e∼ er ine ene eni

dp_ine dp_ene dp_eni

There are three ways of adding an affix to a stem. The simplest one is to add a suffix. In this case, the column contains the suffix, and the and columns remain empty. The second possibility allows to remove some material before adding the suffix: the material to be removed is specified by the regular expression in . If using regular expressions is not practical or not powerful enough, there is a third possibility: one can specify a particular affixation function (written in Python) in the field . This functionality is used for example to add umlaut (first row in table 3). In the example of table 3, the different suffixes are selected by the NCl feature which has been attributed to the nouns on the basis of their Standard German inflection class. However, these features can be changed. For example, Swiss German dialects tend to use umlaut plural marking more often than Standard German (e.g., Hunde — Hünd ‘dogs’, Pullis — Pülli ‘sweaters’). Currently, the database contains 82 variables and 165 variants. It covers the inflectional paradigms of adjectives, nouns, regular verbs, determiners, and prepositiondeterminer combinations.

4 Problematic Cases In this section, we describe some specific problems that arise on the one hand from the multi-dialect generation approach, and on the other hand from linguistic particularities of the Swiss German dialects.

Morphology Generation for Swiss German Dialects

137

4.1 Lexical Restrictions of Phonetic Transformations Dialect evolution sometimes yields unpredictable results. Some rules apply to certain words but not to others, without there being a clearly identifiable cause. For example, the stem vowel in the words Gras ‘grass’, sparen ‘to save’, Arbeit ‘work’ and Axt ‘axe’ is changed to ä in some Northeastern dialects. It is difficult to generalize a phonetic context that might trigger these transformations. Therefore, we chose to use a “whitelist” which enumerates all the lemmas that undergo this transformation. In other cases, the cause of a specific evolution is known, but is difficult to detect for practical reasons. For example, the two Middle High German vowels û and ou have fallen together in Modern Standard German au, but have remained distinct in Swiss German (uu and au/ou, respectively). In a model based on Standard German input, it is thus impossible to predict the correct Swiss German form. No phonetic cue tells us that Standard German Haus ‘house’ should become Huus, but that Standard German Baum ‘tree’ should remain Baum. Again, we use a whitelist to enumerate the lemmas in either class. Currently, the whitelist contains a total of 13,000 lemmas associated to 39 rules. It has been compiled by a native dialect speaker, on the basis of the Derewo lemma list [9]. For other rules, a blacklist (words that are excluded from the rule) was more practical. It contains 450 lemmas associated to 3 rules. 4.2 Short Verbs Short verbs can be defined as verbs that have a monosyllabic infinitive form. While (written) Standard German only has two short verbs (sein and tun ), Swiss German dialects have a dozen of them. These verbs are characterized by short, irregular forms and rather obscure morpheme boundaries between stem and affix. As an additional difficulty, according to SDS data, there are Northwestern dialects in which short verbs are inflected like regular verbs. In section 3.3, we have presented the tables for lexical substitution. For cases of suppletive morphology, we also use these tables. The field allows us to restrict the rules to certain inflected forms. The example presented in table 4 shows the different plural forms of the verb gehen ‘to go’. The first variant generates a short stem gö and adds the feature , which will trigger an affixation rule common to all short verbs, yielding göö, göi, gönd, gön etc. The second and third variants generate long stems and add the feature , which will trigger regular verb inflection rules, yielding gange or gönge.

5 Experiments In this section, we report the results of a simple experiment intended to determine the coverage of the transformation rules. The most straightforward evaluation method would be to start with a list of annotated Standard German inflected words and to evaluate the Swiss German output generated for several dialects. However, it is hard to obtain reliable acceptability judgements from

138

Y. Scherrer

Table 4. Extract of the lexical substitution tables, showing the relevant entries for generating plural forms for the short verb gehen

gehen-pl

gehen

VVFIN

Pl,Pres,Ind

gehen-pl gehen-pl gehen-pl

gö gang göng

dp_gö dp_gang dp_göng

KurzV RegV RegV

3-058-gehen-pl

kurzV-pl-vokal ö-e

dialect speakers accustomed to highly variable spellings and pronunciations. Instead, we measured how many words of an existing multi-dialectal Swiss German corpus are analyzed correctly by our system. The multi-dialect corpus consists of 100 sentences in five dialects, extracted from the Swiss German Wikipedia: Basel (BA), Bern (BE), Eastern Switzerland (OS), Wallis (WS), and Zürich (ZH). The dialect classification was done directly by the Wikipedia writers. These texts were then translated back to Standard German. As our system is not conceived as an analyzer, we simulate this capability: starting with a Standard German word list, we generate a full form dialect lexicon with the help of our generator. Analyzing a word from the corpus then amounts to looking it up in the full form dialect lexicon. The (morphosyntactically annotated) Standard German word list has been extracted from the leaf nodes of the TIGER treebank [1]. With this approach, we can only recognize dialect words whose Standard German counterparts occur in the TIGER lexicon. As a result, many compound nouns and proper nouns are not recognized, even if the transformation rules would permit it. Because of this restriction, the maximum accuracy of our system lies at about 70% of word types and about 80% of word tokens. We first obtained coverage figures without geographical filtering. In this scenario, when a Basel dialect word is analyzed, the system may also return derivations that are only valid in the region of Bern. The results are presented in the first row of table 5. Except for the notoriously difficult Wallis dialect, the figures are fairly consistent across dialects: about 40% of word types and about 60% of word tokens are analyzed correctly. The second scenario involved geographical filtering, retaining only analyses that obtained a minimal probability9 of 10% in the most representative city of the respective dialect area.10 Results are given in the second row of table 5. With respect to the first scenario, there is only a slight performance drop (about 5% for types as well as tokens) 9 10

Recall that each rule comes with a probability map. When several rules are applied to a word, the respective probability maps are combined by pointwise multiplication. The city of Basel for BA, the city of Bern for BE, St. Gallen for OS, Brig for WS, and Zürich for ZH.

Morphology Generation for Swiss German Dialects

139

Table 5. Percentages of correctly analyzed dialect words Types BA

BE

OS

Tokens WS

ZH

BA

BE

OS

WS

ZH

Without geographical filtering 42% 40% 41% 25% 45% 62% 57% 60% 44% 65% With geographical filtering 37% 29% 27% 17% 40% 57% 47% 44% 30% 58%

for Basel and Zürich dialects, the three other dialects show performance drops ranging from 8% to 16%. The latter regions are larger and show more internal dialect variation than the former. In addition, the Wikipedia authors of the three latter regions probably use a dialect that diverges from the reference city dialect chosen for our evaluation. Several types of errors are encountered. First, some errors are due to different spelling choices. Indeed, the lack of binding spelling rules for Swiss German dialects makes the task difficult. For example, our system generated bestaat ‘consists’ for Zürich dialect, while the Wikipedia corpus contained bestaht. Both variants are pronounced identically; the former conforms to the Dieth spelling guidelines [3], while the latter is closer to Standard German spelling rules. We found that the Wikipedia authors prefer a spelling closer to Standard German especially for long, complex words. Other errors are due to missing rules. For instance, Standard German Kirche ‘church’ is phonetically transformed to Chirche, while a lexical transformation should be used to obtain Chile in Zürich dialect. Likewise, some specific inflectional affixes for Wallis dialect have not been implemented, which partially explains the lower scores. Another type of error is due to diachronic change. Standard German zeigt ‘shows’ yields zäägt, zaagt, zeigt in Eastern Swiss German. While all of these forms have been used widely in that region in the 1950s (at the time of the SDS inquiries), they have become marginal today. The most frequently used version today is zaigt, which is indeed what we find in the Wikipedia texts.

6 Conclusion We have presented a cross-lingual, multi-dialectal approach to word generation for Swiss German dialects. Cross-lingual word generation only makes sense if a large amount of lexical pairs are cognates, and if the inventory of morphological and lexical features is fairly parallel across both language varieties. Because of the close etymological connection between Modern Standard German and Swiss German, these conditions are met. Cross-lingual generation allows us to rely on existing resources for the source language, which are much more numerous than for the target dialects. Multi-dialectal coverage is achieved by using existing dialectological resources in the form of probability maps. To our knowledge, this is a novel line of research. Given these particularities and the largely manual creation of the rule base, we obtain honorable coverage figures of about 50% of tokens on several dialects. The proposed set of transformation rules could be used as a part of a machine translation system between Standard German and the Swiss German dialects. Other potential

140

Y. Scherrer

applications include morphosyntactic analysis of dialect texts in order to enhance information retrieval, and integration in speech recognition and synthesis systems. The latter point is especially interesting given the mainly spoken usage of Swiss German dialects. In future work, we plan to improve the rules on the basis of a detailed error analysis. Furthermore, a reimplementation with a finite-state toolkit would provide numerous benefits, such as higher speed and bidirectionality.

References 1. Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER Treebank. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol (2002) 2. Chambers, J.K., Trudgill, P.: Dialectology, 2nd edn. Cambridge University Press, Cambridge (1998) 3. Dieth, E.: Schwyzertütschi Dialäktschrift, 2nd edn., Sauerländer, Aarau (1986) 4. Fischer, L.: Luzerndeutsche Grammatik. In: Ein Wegweiser zur guten Mundart, 2nd edn., Comenius, Hitzkirch (1989) 5. Gallmann, H.: Zürichdeutsches Wörterbuch, 2nd edn. Verlag Neue Zürcher Zeitung, Zürich (2009) 6. Häcki Buhofer, A., Gasser, M., Hofer, L.: Das Neue Baseldeutsche Wörterbuch. Christoph Merian Verlag, Basel (2010) 7. Hammarström, H., Borin, L.: Unsupervised learning of morphology. Computational Linguistics 37(2) (to appear, 2011) 8. Hotzenköcherle, R., Schläpfer, R., Trüb, R., Zinsli, P. (eds.): Sprachatlas der deutschen Schweiz, Francke, Bern (1962-1997) 9. Institut für Deutsche Sprache, Programmbereich Korpuslinguistik: Korpusbasierte Wortformenliste (bzw. Grundformenliste) DEREWO, v-30000g-2007-12-31-0.1, mit Benutzerdokumentation, Mannheim (2007), ! 10. Lötscher, A.: Schweizerdeutsch. Geschichte, Dialekte, Gebrauch. Huber, Frauenfeld (1983) 11. Marti, W.: Berndeutsch-Grammatik für die heutige Mundart zwischen Thun und Jura. Francke, Bern, Switzerland (1985) 12. Rash, F.: The German Language in Switzerland. In: Multilingualism, Diglossia, and Variation, Peter Lang, Bern (1998) 13. Rumpf, J., Pickl, S., Elspaß, S., König, W., Schmidt, V.: Structural analysis of dialect maps using methods from spatial statistics. Zeitschrift für Dialektologie und Linguistik 76(3) (2009) 14. Scherrer, Y., Rambow, O.: Natural language processing for the Swiss German dialect area. In: Proceedings of KONVENS 2010, Saarbrücken (2010) 15. Thielen, C., Schiller, A., Teufel, S., Stöckert, C.: Guidelines für das Tagging deutscher Textkorpora mit STTS. Tech. rep., University of Stuttgart and University of Tübingen (1999)

Author Index

Pecina, Pavel 98 Pinnis, M¯ arcis 14 Pirinen, Tommi A. 67

Attia, Mohammed 98 Axelson, Erik 67 Faaß, Gertrud

46 Sagot, Benoˆıt 23 Scherrer, Yves 130 Silfverberg, Miikka 67

Gdaniec, Claudia 86 Goba, K¯ arlis 14 Hardwick, Sam Karttunen, Lauri Kuboˇn, Vladislav

67

Larasati, Septina Dian Lind´en, Krister 67 Manandise, Esm´e

Toral, Antonio Tounsi, Lamia

1 119

86

98 98

van Genabith, Josef

98

119 Walther, G´eraldine Zeman, Daniel

119

23