This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
algorithm ...
... ... ... ... ... ) <mão> na massa, in wich there is a kind of "trial to recover" its compositionality (in a sort of "unfrozeness"). All the other occurrences in these concordances are frozen ones. The problem here is the identification of the frozen strings and the procedure to distinguish them from the compositional strings. In this research stage, we think it becomes necessary to identify these strings examining each occurrence: the intervention of a linguist becomes necessary to separate the compositional from the frozen occurrences. In this example, in among 67 strings found in the corpus, there was five compositional occurrences. Even in the cases where there was a "trial to recover" the compositional sense, the presence of a frozen expression was evident.
Phone [O] [o] [o˜] [o] [u] [o˜] [o˜] [o˜] [u] [o] Phone [pS] [pS] [f] [p]
Example vov´ o vovˆ o cora¸c˜ oes co-produ¸ca ˜o O m´ usico ´e feliz. ontem, compositor Comprar na Kibon; Est´ a no tom. soma, sono carros escopo Example pneu piano esphera pato
Table 10. Table of rules for graphemes Grapheme Sequence for
algorithm Phone Example ...
... [kS] quito, quente ...
... [k] quando Grapheme Sequence for
Within each grapheme block, individual rules are (a) disjunctively ordered, so that if a rule has been applied all the others are skipped; and (b) layered in the order they are checked, so that the last rule for every grapheme is applied whenever none of the other rules apply, i.e, it is the default rule. For a given text to be transcribed, the algorithm of mapping a given grapheme into its respective acoustic unit follows the order of appearance of each grapheme. For every grapheme of the sentence, the correspondent algorithm is called, concatenating into the transcribed sentence the generated acoustic unit. It is worth noting that a rule can skip the next grapheme to be analyzed, such as in the fourth rule of grapheme algorithm, where both graphemes
28
F. Barbosa et al. Table 11. Table of rules for graphemes <s, t> Grapheme Sequence for <s> algorithm Phone Example ... <s>
Table 6, no phonetic representation is used, as this grapheme is not pronounced in Portuguese. Also, it is important to highlight that a given grapheme can be mapped into more than one acoustic unit, as can be seen in Table 13. In the fifth rule of this table, the grapheme <x> is mapped into two acoustic units, [kS][s], as well as in the sixth rule, if the word that contains <x> belongs to the exception list of Table 14.
Grapheme-Phone Transcription Algorithm for a Brazilian Portuguese TTS
29
Table 13. Table of rules for graphemes <x, y, z> Grapheme Sequence for <x> algorithm ... <(W bgn)(e,ine)><x>
Phone [z]
Example execrar, inexistˆencia ... <(W bgn)(e,ˆe,ine)><x>
3
Experimental Results
The proposed rules were implemented and tested using part of the text from the CETEN-Folha database [6]. The phones originated by the algorithm were checked, and 98,43% of them were correctly transcribed. A resume of the errors can be seen in Table 15.
30
F. Barbosa et al. Table 15. Table of errors on mapping the graphemes Error type Occurrences Occurrence(%) [e] or [E] misplaced 22 0.28% [o] or [O] misplaced 18 0.23% Incorrect foreign word phones 31 0.40% Diphthongs 35 0.44% Incorrect acronym phones 17 0.22%
As can be seen from Table 15, the errors come from diphthongs, foreign words, acronyms and confusion between [O] or [o], and [E] or [e]. Rules to handle these cases are subject of ongoing research.
4
Conclusions
This paper presents rules for generating a sequence of phones from a grapheme sequence to be applied in a BP TTS system. The proposed rules were tested using part of the CETEN-Folha text database giving rise to 98,43% of correctly transcribed phones. Present research concentrates on proposing rules to deal with foreign words, names, diphthongs, and decision for the phones [O] or [o], and [E] or [e]. Rules to discriminate the different levels of nasality and stress for some acoustic units, such as [6] and [6˜], for example, are also subject of future work.
References 1. Cunha, Celso.: L´ıngua portuguesa e realidade nacional. 2a . ed. atualiz. Rio de Janeiro: Tempo Brasileiro, 1970. 2. Anais do Primeiro Congresso Brasileiro de L´ıngua Falada no Teatro, 1958. 3. Anais do Primeiro Congresso da L´ıngua Nacional Cantada, 1938. 4. Ramos, Jˆ ania M.: Avalia¸ca ˜o de dialetos brasileiros: o sotaque. In: Revista de Estudos da Linguagem. Belo Horizonte: UFMG. 6 (5):103–125. jan.–jun. 1997. ´ 5. Almeida, M.J.A.: Etude sur les attitudes linguistiques au Br´ esil Universit´e de Montreal, 1979. 6. Corpus de Extractos de Textos Electrˆ onicos NILC/Folha de S˜ ao Paulo (CETENFolha). http://acdc.linguateca.pt/cetenfolha/ 7. Speech Assessment Methods Phonetic Alphabet (SAMPA). http://www.phon..ucl.ac.uk/home/sampa 8. Alcaim, Solewicz, and Moraes: Frequˆencia de ocorrˆencia dos fones e listas de frases foneticamente balanceadas no Portuguˆes falado no Rio de Janeiro. Revista da Sociedade Brasileira de Telecomunica¸co ˜es. Revista da Sociedade Brasileira de Telecomunica¸c˜ oes. 7(1): 23–41. 9. Pinto, G.O., F. Barbosa, and F.G. Resende Jr. A.: Brazilian Portuguese TTS based on HMMs. 2002. In: Proceedings of International Telecommunications Symposium. Natal, Rio Grande do Norte.
Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment Using Multiple Acoustic Features S´ergio Paulo and Lu´ıs C. Oliveira L2 F Spoken Language Systems Lab. INESC-ID/IST Rua Alves Redol 9, 1000-029 Lisbon, Portugal {spaulo,lco}@l2f.inesc-id.pt http://www.l2f.inesc-id.pt
Abstract. The phonetic alignment of the spoken utterances for speech research are commonly performed by HMM-based speech recognizers, in forced alignment mode, but the training of the phonetic segment models requires considerable amounts of annotated data. When no such material is available, a possible solution is to synthesize the same phonetic sequence and align the resulting speech signal with the spoken utterances. However, without a careful choice of acoustic features used in this procedure, it can perform poorly when applied to continuous speech utterances. In this paper we propose a new method to select the best features to use in the alignment procedure for each pair of phonetic segment classes. The results show that this selection considerably reduces the segment boundary location errors.
1
Introduction
Phonetic alignment plays an important role in speech research. It is needed in a wide range of applications, from the creation of prosodically labelled databases, for research into natural prosody generation, to the creation of training data for speech recognizers. Furthermore, the development of many corpus-based speech synthesizers [1,2]) requires large amounts of annotated data. Manual phonetic alignment of speech signals is an arduous and very time consuming task. Thus, the size of the speech databases that can be labelled this way are obviously very constrained, and the creation of large speech inventories requires some sort of automatic method to perform the phonetic alignment. While building a system to automatically align a set of utterances, two different problems can be found. First, we have to know the sequence of phonetic segments observed in those utterances. Then, we have to locate the segment boundaries. The sequence of segments can be obtained by using a pronunciation dictionary or by applying a set of pronunciation rules to the orthographic transcription of the utterances. However, it is, usually, not possible to predict the exact sequence uttered by the speaker and we must take into account possible disfluencies, N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 31–39, 2003. c Springer-Verlag Berlin Heidelberg 2003
32
S. Paulo and L.C. Oliveira
elisions, allophonic variations, etc. In this work, we will assume that we already have the right sequence of segments and we will focus on the task of locating the segment boundaries. Several approaches have been taken to try to solve this problem. The most widely explored technique is the use of HMM-based speech recognizers (sometimes hybrid systems, based on HMM and Artificial Neural Networks) in forced alignment mode. This approach relies on the use of phone models built under the HMM framework. This models are trained using large amounts of labelled data, recorded from several speakers, to take into account the phone’s acoustic properties in very different contexts. For single speaker databases, the performance of the system can be improved by adapting the speaker independent models to the speaker’s voice. The difficulty of this approach is that it requires the availability of segmented data for the speaker. This material must be annotated following strict segmentation rules so that the resulting system can locate segment boundaries with the necessary precision. When no such system is available, a Dynamic Time Warping (DTW, [3]) based approach can be taken. This technique was used in early days of speech recognition to compare and align a spoken utterance with pre-recorded models, taking into account possible variations on the speaker’s rhythm. The recognized utterance corresponded to the model with the minimum accumulated distance after the alignment. The same methodology can be used for the phonetic alignment problem as described in [6] and [7]. This procedure, also known as speech synthesis based phonetic alignment, starts by producing a synthetic speech signal with the desired phonetic sequence that allows us to know the exact location of the phonetic segment boundaries. This can easily be achieved using a modified speech synthesizer. The next step is to compute, every few milliseconds, vectors of acoustic features for both the synthetic and natural speech signals. By using some type of distance measure, the acoustic feature vectors can be aligned with each other using the DTW algorithm. The algorithm result is a time alignment path between the synthetic and natural signal time scales, that allows us to map the segment boundaries from the synthetic signal into the natural utterance. This approach does not require any previously segmented speech from the same speaker but the results depend, in some extent, on the similarity between the synthesizer’s and speaker’s voice, and they should have, at least, the same gender. The performance of this method is strongly dependent on the selection of the acoustic features used in the alignment procedure and on the distance used to compare them. This work is part of an effort to automate the process of multi-level annotation of speech signals. A complete view about this problem can be found in [4]. In this paper, we will describe our work on the use of different features to improve the performance of a DTW-based phonetic alignment algorithm. The results of this study lead us to a new method to perform the alignment that uses multiple acoustic features depending on the class of segments to be aligned. The paper is divided into five sections. The next section describes the process for producing the synthetic reference signal with segmentation marks. The
Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment
33
following section describes an automatic procedure for the selection of the most relevant acoustic features. These results are then applied in the next section, where the alignment procedure is described. The final section compares the results of the new method with a traditional approach.
2
Waveform Generator
An important issue on the DTW-based phonetic alignment is the generation of the reference speech signal. This can be achieved by using some sort of a speech synthesizer, that can be modified to produce the desired phonetic sequence together with the segment boundaries. The problem with this solution is that the signal processing required to impose the rhythm and intonation determined by the prosody module also introduces distortions on the synthetic signal. For our purposes, these prosodic modifications are not necessary and a simple waveform concatenation system was used. Since our goal was to locate the segment boundaries, we used diphones as concatenation units. This way, the concatenation distortion is located in the middle of the phone and does not affect the signal in the phone boundary. In order to have a general purpose system it must be able to produce any phonetic sequence and the inventory must contain all the possible diphones in the language. We followed the common approach of generating a set of nonsense words (logathomes), containing all the required diphones in a context that minimizes the co-articulation with the surrounding phones. A speaker was asked to read the logathomes in a sound proof room and was recorded using a head mounted microphone in order to keep the recording conditions reasonably constant among sessions. We also asked the speaker to keep a constant intonation and rhythm. The recorded material was then manually annotated. We used the unit selection module of the Festival Speech Synthesis System[8] to perform the concatenation. A local search is made around the diphone boundaries to find the best concatenation point. We used the Euclidean distance between the Line Spectral Frequencies (LSF) for costing the spectral discontinuities of the speech units.
3
Acoustic Features
We considered some of the most relevant acoustic features used in speech processing: the mel frequency cepstrum coeficients (MFCC) and their differences (deltas), the four lowest resonances of the vocal tract (formants), the line spectral frequencies (LSF), the energy and its delta and the zero crossing rate of the speech signal. Both the energy and the MFCC coefficients, as well as their deltas, were computed using software from the Edinburgh Speech Tools Library [9]. The formants were computed using the formant program of the Entropic Speech Tools [10] and the remaining features were computed using our own programs.
34
S. Paulo and L.C. Oliveira
Our first experiments showed that each of these features used separately produced uneven results. Depending on the class of phones to be aligned some features proved better than others. For instance, in a vowel-plosive transition, the energy feature was the performer, but for vowel-vowel transition, the best results were achieved using formants as features. This immediately suggested the use of multiple features to distinguish the different phone transition classes. 3.1
Feature Normalization
The combination of multiple features requires a previous normalization step to equalize its influence on the overall alignment cost. It was decided to normalize the values to the range [0, 1]. The first stage was to determine which of the features had values that followed a Gaussian distribution. Observing the histograms of each coefficient, the MFCCs and their deltas were the only ones that matched that distribution. The mean and standard deviation were computed for each one of them, and the normalization was then performed, using the equation: xi =
1 X i − µi + 2 2σi
(1)
where xi , Xi , µi and σi are the normalized value, the non-normalized value, the mean value, and the standard deviation of the ith MFCC, respectively. The LSF values were divided by π. Since the zero crossing rate was computed by evaluating the ratio between the number of times the speech signal crosses the zero magnitude and the number of signal samples existing in a fixed size window (some milliseconds), its values have already the right magnitude (between 0 and 1). For the energy, its delta and for the formants, maximum and minimum values were found for each utterance, and their mean values were computed (Yimax and Yimin ). The normalized values were calculated using the following equation: yi = 3.2
Yi − Yimin Yimax − Yimin
(2)
Feature Selection Procedure
Having all the features normalized, the next goal was to find which were more relevant in a given phonetic context. That is, which feature allowed us to locate the boundary with greater precision. For this purpose we had a set of 300 manually aligned utterances that we use to evaluate the relevance of each feature. These utterances were spoken by a different speaker than the one used to record the diphone inventory. The waveform generator previously described was used to produce reference synthetic signals for the phonetic sequences of these utterances and sets of feature vectors were computed every 5 milliseconds for both the reference and spoken signals. Using the Euclidean distance, a matrix was computed with the distances between all the feature vectors of the two series. Figure 1 shows a rough representation of this matrix. We then evaluated each distance on its capacity to
Frames of the synthesized speech signal
Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment
Frames of the recorded speech signal u i # (vowel)
(silence)
#
(vowel)
35
# (silence)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(silence)
u (vowel)
i (vowel)
#
i i+1
(silence) j j+1
Fig. 1. Graphical representation of the distance matrix regions used for choosing the best feature / pair of features to align the different pairs of phonetic segments
discriminate the difference between two consecutive phones. This was achieved by computing the average distance between feature vectors of the same phone (dists ), and of different phones (distd ). Using the example in Fig. 1, if we want to choose an acoustic feature to distinguish the silence (#) and the vowel u, the dists is the average of the values in regions 1 and 6 on that matrix, while the distd is the average of the values on regions 2 and 5. This procedure was performed for every pair of phones and for every utterance on the training set, and its resulting values were saved at the end of each iteration. Finally, we computed an average value of the ratio between dists and distd for each pair of phonetic segments and for each acoustic feature. The chosen feature is the one that gives a minimal value for this ratio using the equation: Fk = min x
Nk dists (k, x, i) distd (k, x, i) i=1
(3)
where, x is one of the tested features, k represents the pair of phones that is being analyzed, Nk is the number of instances of this pair in our set of utterances, Fk is the best feature for this type of transition, and dists (k, x, i) and distd (k, x, i) are the mean distances for the instance i using the acoustic feature x. The smaller is that ratio, the greater is probability of having well aligned frames, locally at least. With this approach, we are trying to use the features that assign the greatest penalty for the alignment paths when they fall out of the darkest regions of Fig. 1 (regions 1, 6, 11 and 16). Given the reduced amount of training data, we soon realized that it would be impossible to have a large enough number of instances, for each pair of segments to produce confident results. Thus the different phonetic segments were grouped into phonetic classes: vowels, fricatives, plosives, nasals, liquids and silence. The
36
S. Paulo and L.C. Oliveira Table 1. Best feature pairs for the multiple phonetic segment class transitions
Nasals Fricatives Liquids Plosives # Vowels
Nasals Fricatives Liquids Plosives # Vowels frm+lsf mfcc+zcrs frm+en lsf+en frm+en mfcc+mfcc lsf+lsf mfcc+en en+zcrs lsf+en zcrs+en lsf+lsf lsf+en lsf+en lsf+lsf mfcc+en mfcc+en frm+mfcc lsf+en lsf+lsf lsf+en mfcc+mfcc lsf+zcrs mfcc+en lsf+en lsf+en lsf+en lsf+en x lsf+en mfcc+en zcrs+lsf mfcc+en lsf+en mfcc+en frm+mfcc
semi-vowels were grouped into the class of the vowels. The described procedure for differentiating the phones was then repeated using phone class transitions (vowel-vowel, fricative-vowel, etc.). The analysis of the results showed that, in general, for each pair of phone class transition, at least two of the selected features showed good discriminative capacity. This could suggest some equivalence between the two features but it could also mean that the two features were complementary. This way we performed a combined optimization to select the pair of features for each phone class pair. The process could be extended to a combination of even more features but the results showed that there was no significant improvement in using more than a pair of features. The Table 1 shows the results of this procedure, where mfcc, lsf, frm, en and zcrs are the MFCC coefficients and their deltas, LSFs, formants, energy and its delta, and the zero crossing rate, respectively. The x symbol means that this class transition does not exist in the training set. The best feature pair for a transition x-y, is located on the line of x and column of y.
4
Frame Alignment
Before applying the DTW algorithm the distance measure matrix between the reference and the spoken signal must be built. Since we know the boundary locations of the synthetic segments, the distance matrix can be built iteratively, phone-pair by phone-pair. Taking the example shown in Fig. 1, to build the distance matrix we start by computing the matrix values for all the rows that correspond to the phone-pair # -u using the best pair of features, based on the former results. However, the phone u also belongs to the next phone-pair (u-i ) and the computed distance is multiplied by a decreasing triangular weighting window. The distance for the next phone pair (u-i ) is then computed using the best pair of features for the vowel-vowel transition and its value is added to the rows corresponding to segment u weighted by an increasing triangular window. Figure 2 shows this weighting triangular windows, where the dotted lines are the weighing factor of the previous phone-pair distances and the dashed lines are the weights of the distances for the next phone-pair. After computing all the values of the distance matrix, the DTW algorithm is applied to find the path that links the top left corner of the matrix to the lower right corner with a minimum accumulated
Scale factor 0
# Feature x
(silence)
u (vowel) Feature y
1
37
Frames of the recorded speech signal
i (vowel) Feature z
Frames of the synthesized speech signal
Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment
# (silence)
Fig. 2. Graphical representation of the necessary operations for building the distance matrix
distance. This path will be the alignment function between the time scale of the synthetic reference signal and the spoken utterance.
5
Results
The procedure described in the previous section was applied to the reference corpus of 300 manually annotated sentences. The results are depicted in Fig. 3 where the lower solid line is the annotation accuracy when the entire set is aligned using always a feature vector 12 Mel-frequency cepstrum coefficients and their differences. Only 46% of the phonetic segments were aligned with an error less than 20 ms. Using only the best feature for computing the distance for each phone class pair increases the 20ms accuracy to 59% of the segments (dashed line). This result can be improved to 70% by combining two features for computing the distance measure. The relatively low percentage of agreement for tolerances lower than 20ms can be partially explained by the fact that the segmentation criteria used in the annotation of the reference corpus was not exactly the same as the one used in the segmentation of the logathomes used to produce the synthetic reference. Another difficulty was that the speech material in the reference corpus was uttered by a professional speaker with a very rich prosody and large variations in energy, where several consecutive voiced speech segments become unvoiced. This is, in our opinion the main reason for about 4% of disagreement within high tolerances (about 100 milliseconds). We hope to detect this alignment problems with some confidence measures based on the alignment cost per segment and by phone duration statistics. As soon as we have more annotated material we also plan to
38
S. Paulo and L.C. Oliveira
Automatic/Manual Agreement(%)
100
90
80
70
60
50
Best Feature−Pair 40
Best Feature
30
MFCC+deltas
20
10
0
10
20
30
40
50
60
70
80
90
100
Maximum Allowed Error(msec). Fig. 3. Accuracy of some versions of the proposed algorithm and a classic DTW-based phonetic alignment algorithm
evaluate the annotation accuracy for a corpus on which we had not optimize the feature selection in order to test the generality of the selected features
6
Conclusions
In this work we have presented a method for selecting the most relevant pair of features for aligning two speech signals with the same phone pairs but with different durations. This features were then used in a DTW-based method for performing the phonetic alignment of a spoken utterance. The results clearly show the advantage of selecting the most appropriate features for each class of segments in the alignment of two utterances: the most commonly used feature, MFCCs, performed well bellow the proposed method. Acknowledgements. The authors would like to thank M. C´eu Viana and H. Moniz for providing the manually aligned reference corpus. This work is part of S´ergio Paulo’s PhD Thesis sponsored by a Portuguese Foundation for Science and Technology (FCT) scholarship. INESC-ID Lisboa had support from the POSI Program.
References 1. M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou and A. Syrdal, The AT&T Next-Gen TTS System, 137th Acoustical Society of America meeting, Berlin, Germany, 1999. 2. A. Black, CHATR, Version 0.8, a generic speech synthesizer, System documentation, ATR-Interpreting Telecomunications Laboratories, Kyoto, Japan, 1996.
Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment
39
3. Sakoe H. and Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. on ASSP, 26(1):43–49, 1978. 4. S. Paulo and L. Oliveira, Multilevel Annotation of Speech Signals Using Weighted Finite State Transducers. In Proceedings of IEEE 2002 Workshop on Speech Synthesis, Santa Monica, California, 2002. 5. D. Caseiro, H. Meinedo, A. Serralheiro, I. Trancoso and J. Neto, Spoken Book alignment using WFST HLT 2002 Human Language Technology Conference, San Diego, California, 2002. 6. F. Malfr`ere and T. Dutoit, High-Quality Speech Synthesis for Phonetic Speech Segmentation. In Proceedings of Eurospeech’97, Rhodes, Greece, 1997. 7. N. Campbell, Autolabelling Japanese TOBI. In Proceedings of ICSLP’96, Philadelphia, USA, 1996. 8. A. Black, P. Taylor and R. Caley, The Festival Speech Synthesis System. System documentation Edition 1.4, for Festival Version 1.4.0, 17th June 1999. 9. P. Taylor R. Caley, A. Black, S. King, Edinburgh Speech Tools Library System Documentation Edition 1.2, 15th June 1999. 10. ESPS Programs Version 5.3 Entropic Research Laboratories Inc., 1998.
Evaluation of a Segmental Durations Model for TTS João Paulo Teixeira and Diamantino Freitas Polytechnic Institute of Bragança and Faculty of Engineering of University of Porto, Portugal [email protected], [email protected]
Abstract: In this paper we present a condensed description of a European Portuguese segmental duration’s model for TTS purposes and concentrate on its evaluation. This model is based on artificial neural networks. The evaluation of the model quality was made by comparison with read speech. The standard deviation reached in test set is 19.5 ms and the linear correlation coefficient is 0.84. The model is perceptually evaluated with 4.12 against 4.30 for natural human read speech in a scale of 5.
1
Introduction
The presented segmental duration’s model is part of a global prosodic model for a European Portuguese TTS system, which is under development in the authors’ Institutions. It is based on artificial neural networks that process the input of linguistic information relative to the context of each phoneme, and outputs the predicted duration for each of its segments. A series of durations’ models can be found in the literature for other languages, mostly for American and British English. The most prominent ones will be mentioned in the following. Campbell introduced the concept of Z-score [1] to distribute the duration estimated, by a neural network, for a syllable, considering that it would be the more stable unit for prediction of duration. He measured a linear correlation coefficient (r) between speakers taking the syllable as unit of r=0.92 and only r=0.76 for segments. He reported an r=0.93 for the syllables in his model. This concept isn’t however generally accepted. Others authors, like Van Santen [2] use the phoneme as the segmental unit in order to predict durations in a Sum-of-Products model. The author reported r=0.93 in his database. Barbosa and Bailly [3] employed the concept of InterPerceptual Center Groups (IPCG) as the stable unit, and applied a neural network to predict its’ duration and subsequently the Z-score to determine the duration of each phoneme inside the IPCG. This model can deal with speech rate. They reported standard deviation for French σ=43ms, and later, Barbosa reported a σ=36ms for Brazilian Portuguese [4]. Other relevant models are the Klatt model [5] based on a Sum-ofProducts; the rule-based algorithm for French presented by Zellner [6] for two different speech rates, obtaining an r=0.85 and arguing that this value corresponds to the typical inter-speaker correlation; the look-up table based model for Galician [7] with a rmse (root-mean squared error) value of 19.6 ms in the training data; the neural netN.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 40–48, 2003. © Springer-Verlag Berlin Heidelberg 2003
Evaluation of a Segmental Durations Model for TTS
41
work based models for Spanish [8] and for Arabic [9], achieving r=0.87; the CARTbased model for Korean [10] with r=0.78. The final model we consider in this introduction was developed by Mixdorff as an integrated model for durations and F0 for German [11], having achieved r=0.80 for the durations. Existing durations’ models can be classified as statistical, mathematical or rulebased models. Besides the present one, examples of other statistical models can be [1,3,4,8–11], although [1] and [3] use the Z-score concept. These types of models became interesting with the availability of large databases. Examples of mathematical models can be [2] and [5]. Rule based models are [6] and [7]. The basic idea behind our approach comes from the fact that the duration of a segment depends, in a complex manner not only on a set of contextual features derived from both the speech signal and the underlying linguistic structure, but also on random causes. We therefore try to take into consideration most of the known relevant features of different kinds that are candidates to be influential on duration value and try to determine the complex dependency function in a robust efficient statistical manner that fits the selected database. This is known in advance not to contain all possible different combinations of features. Additionally the considered set of features is not exhaustive. Inter-speaker and intra-speaker variability is well known and should be considered in the results analysis. In that way, what can be expected from such a model is an acceptable timing for the sequence of phonemes, and not exactly the same timing imposed by the speaker. This can only be evaluated perceptually. The data that was used for the training and testing of the model was extracted from the database described in [12]. This database consists of tagged speech tracks of a set of texts extracted from newspapers that were read by a professional male radio broadcast speaker at the average speech rate of 12.2 phonemes/second. The dimension of the part of the data that was used in the present work, covers a total of 101 paragraphs containing a few hundreds phrases, essentially of declarative and interrogative types, with dimensions from one word to more than one hundred, consisting in a total of 18,700 segments of 21 minutes of speech. Phonemes were selected as the basic segment allowing the smallest granularity of the modeling. Section 2 describes the model and Sect. 3 describes the evaluation.
2 Description of the Model 2.1
Duration Features
A large number of features were considered as candidates in the beginning of the work. One by one, they were studied and taken out in order to evaluate their relative importances. In selected cases, a set of a few features was considered and taken out jointly to check for consistency. The conclusion is, in general, that the result is different from considering the isolated features. This is because these features interact nonlinearly in a significant manner. After several experiments, considering different sets of features and the correlation with segment’s duration, one was finally established as giving the best optimization of the performance of the neural network approximation. The coding of features’ values is also an important issue, so some features were coded
42
J.P. Teixeira and D. Freitas
in varying ways, in order to find the best trend and solution. The final set of features of the model and their codifications is listed in order of decreasing importance: a.
Identity of segment – one of the 44 phoneme segments considered in the inventory of the data-base (Table 3);
b.
Position relative to the tonic syllable in the so-called accent group – coded in 5 levels according to its correlation with durations;
c.
Contextual segments identities – previous (-1) and next three (+1, +2, +3) segments – signaling some significant specific phones in referred position and silences (20 phones in position -1; 12 phones in position +1; 4 phones in position +2; 2 phones in position +3);
d.
Type of vowel length in the syllable – coded in 5 levels according to its correlation with durations;
e.
Length of the accent group – number of syllables and phonemes;
f.
Relative position of the accent group in the sentence – first; other; last;
g.
Suppression or non-suppression of last vowel;
h.
Type of syllable – coded in 9 levels according to the correlation with durations;
i.
Type of previous syllable – same as previous;
j.
Type of vowel in previous syllable – same as d;
k.
Type of vowel in next syllable – same as d;
Features b, e and f are linked with the so-called accent groups that we consider as random groups of words with more than two syllables, aggregating neighbor particles. These groups work like prosodic words having only one tonic syllable. Any how they aren’t exactly prosodic words if one considers the multiple definitions in the literature. In feature d we consider 5 types of vowels according to the average duration. These 5 types are: long {a, E, e, O, o}; medium {6, i}; short {u, @}; diphthong and nasal. Feature g codes the eventual suppression of the last vowel in the word as can be found in [12], because this event usually lengthens the remaining consonant, like in the word ‘sete’ (read {sEt} – SAMPA code). The type of syllables mentioned in features h, i and j, are: V, C, CC, (both resulting from suppressed vowel) VC, CV, VCC, CVC, CCV, CCVC. During the above described process of selecting the features to be used, a qualitative measurement of the relative importance comes out. Three groups of features can be distinguishing according to relevance. The first is feature a, clearly the most important one. The second group in relevance is composed of features b, c, d, e, f and g. The third group, with features that alone are not very important, but together assume some relevance, is formed by features h, i, j and k.
Evaluation of a Segmental Durations Model for TTS
2.2
43
Neural Network
The model consists in a feed-forward neural network, fully connected. The output is one neuron that codes the desired duration in values between 0 and 1. This codification is linear in correspondence to the range 0 and 250 ms. The input neurons receive the set of coded features. Similar levels of performance (r=0.833 to 0.839) are achieved with different network architectures (2-4-1, 4-2-1, 6-1, 10-1), activating functions (hyperbolic logarithmic, hyperbolic tangent and linear) and training algorithms (LevenbergMarquardt [13] and Resilient Back-propagation [14]). If the number of weights of the net is not fewer than the number of training situations, and the training is excessive, an over-fitting may occur. In order to avoid this problem, two sets of data were used. One set for training with 14,900 segments and another set for test with 3,000 segments. The test vectors were used to stop training early if further training on the training set will hurt generalization to the test set. The cost function used for training was the mean squared error between output and target values.
3 Model Evaluation Two indicators were used to evaluate the performance of the model: the standard deviation (σ) (Eq. (1)) of the error (e) and the linear correlation coefficient (r) (Eq. (3)). Considering the error as the difference between target and predicted values of duration of segments, the standard deviation of the error (σ) was used, according to the following expressions:
σ=
∑x
2 i
i
N
, xi = ei − e , ei = di _ original − di _ predicted
(1)
where, xi is the difference between the error of each segment and the mean error. The error is given by the difference between predicted and original duration, for each segment. When the mean error is null, as happens in this case, σ is equal to the rmse, given by Eq. (2):
rmse =
rA, B =
VA , B
σ A .σ B
, VA , B =
∑e
2 i
i
(2)
N
∑ ( a − a ). (b − b ) i
i
i
(3)
N
The linear correlation coefficient (r) was the second indicator selected, and is given by Eq. (3), where VA,B is the variance between vectors A=[a a … a ] and B=[b b … b ]. A and B are predicted and target duration vectors. 1
2
N
2
N
1
44
J.P. Teixeira and D. Freitas
The performance in the test and training sets, considering all types of phonemes, is given in Table 1. Table 1. General best performance set Test Training
σ (ms)
r 0.839 0.834
19.46 19.85
Table 2. Performance of the model (r and σ), and average duration for each type of segment r Vowel a 6 E e @ i O o u j w j~ w~ 6~ e~ i~ o~ u~ Aver.
σ (ms)
Aver. (ms) 110 68 97 95 53 69 106 97 57 49 44 64 53 75 107 109 98 86
Phonemes are presented in SAMPA code. l* is a velar l.
Cons. p !p t !t k !k b !b d !d g !g m n J l l* L r R v f z s S
! Represents the occlusive part of stop consonants.
Z Aver.
0.63 0.65 0.62 0.71 0.63 0.58 0.61 0.63 0.56 0.59 0.68 0.28 0.53 0.69 0.65 0.74 0.69 0.79 0.63
26.8 21.1 23.1 28.2 29.5 23 25.8 26.4 24.1 21.2 20 18.6 25.2 24.9 23.6 27.9 25.9 26.9 23.8
r
σ (ms)
0.25 0.39 0.76 0.59 0.41 0.27 0.79 0.23 0.76 0.2 0.73 0.26 0.31 0.33 0.3 0.23 0.73 0.68 0.63 0.38 0.45 0.56 0.37 0.59
9.2 17.8 12.8 16.5 14.4 16.6 11.2 15.1 10.9 17.2 8.9 12.8 18.6 17.9 16.7 19.4 20.9 15.3 12.4 19 19.7 22.7 16.6 24.7
0.68 0.54
24.5 21.4
0.50
16.3
Aver. (ms) 20 64 29 48 37 59 17 43 20 41 20 44 62 54 68 52 68 56 32 73 65 93 70 103 89 78
In the left part of Table 2, the vowels have, in a weighted average, r=0.63 and σ=24 ms. In the right part of the table, r=0.50 and σ=16 ms are presented as weighted average values for consonants. The average value for each phone is very well fitted by the neural network.
Evaluation of a Segmental Durations Model for TTS
45
Figure 1 plots the original versus the predicted durations in the test set for one simulation with r=0.839. There are no major errors. These errors are quite low in short segments and naturally higher in longer ones. Best Linear Fit: A = (0.68) T + (16.9) 250 R = 0.839
Data Points Best Linear Fit A=T
200
A
150
100
50
0
0
50
100
150
200
250
T
Fig. 1. Best linear fit for original and predicted duration for one simulation in the test set with r=0.839
3.1 Perceptual Evaluation One last evaluation of the model presented in this paper is the perceptual test. Five paragraphs from the test set were used for this purpose. Three realizations of each paragraph were presented to 19 subjects (8 experts and 11 non-experts) for evaluation in a scale from 0 to 5. One realization was natural speech (original); another was a time-warped natural speech with durations predicted by the model (model); and the last realization, also time-warped speech with the average duration value for each phone (average). Time-warped modifications were done with a TD-PSOLA algorithm. Table 3. Scores of model for the paragraphs presented to the listeners Paragraph 1 2 3 4 5
N. of seg. 36 164 177 209 204
σ (ms) 19.0 18.9 22.6 19.0 19.8
r 0.97 0.89 0.94 0.91 0.94
46
J.P. Teixeira and D. Freitas
The subjects didn’t know which stimulus corresponds to each realization and they could hear as many times as they want. Table 3 presents, for each paragraph, its number of segments, plus σ and r values, for the predicted durations. In all the cases the scores between experts and non-experts subjects were very similar, so they were merged.
Model
Average
Original
5
5
4
4 Average
Average
Original
3 2
Model
Average
3 2 1
1
0
0 1
3
5
7
9
11 13 15 17 19
Subjects
1
2
3
4
5
Paragraph
Fig. 2. Average score of perceptual test by subject (left side) and by paragraph (right side)
Figure 2 (left side) shows the average evaluation by subject for original, modified by the model and fixed average duration. For most of the listeners the model is very close to the original, and in four cases the model is even preferred. Figure 2 (right side) presents the average evaluation by paragraph. Again, the model is very close to the original, and in paragraph 3 is even preferred. Finally, Fig. 3 characterizes the subjects’ opinions representing for each of the three sets of realizations the minimum, the ensemble of the lower quartile, median and upper quartile in the notched box, the maximum, the mean with thick lines and outliers. The original utterances achieved a mean score of 4.30, the ones with durations imposed by the model achieved 4.12 and the ones with durations imposed with average value for each phoneme achieved 3.53. In one way, ANOVA gives p<1e-12, for an F=31.4, meaning a significance higher than 99.9%. The 0.18 points of distance to the original utterances mean that the sentences produced with predicted durations are quite close to natural.
4
Conclusion
A statistical model for segmental durations in European Portuguese was presented. This model is based on artificial neural networks that receive linguistically-oriented contextual information for the segment to process and predict its duration. Results are presented and discussed, showing a good objective performance of the model. This evaluation was done comparing model output durations with the target durations.
Evaluation of a Segmental Durations Model for TTS
47
5
Values
4 3 2 1
2
3
Fig. 3. Opinion score for: (1-original); (2-model); (3-average)
In one way, ANOVA gives p<1e-12, for an F=31.4, meaning a significance higher than 99.9%. The 0.18 points of distance to the original utterances mean that the sentences produced with predicted durations are quite close to natural. The objective evaluation, considering all type of segments, shows that the model may be considered being at a good quality level, when compared with most of the published work in other languages. Subjective evaluation was done by a perceptual test and shows that the model is quite close to the original (distance 0.18 in 5) and relatively far from the fixed realizations with averaged durations (0.59 in 5). Finally, from the observation of several examples, the model predicts quite consistently the durations of final segments of words, where other authors report some troubles.
References 1.
Campbell, W.N., “Predicting Segmental Durations for Accommodation within a SyllableLevel Timing Framework”, Proceeding Eurospeech 93, volume 2, pag. 1081–1084. 2. Van Santen, J.P.H., “Assignment of segmental duration in text-to-speech synthesis”, in Computer Speech and Language, 8, 95–128, 1994. 3. Barbosa P., Bailly G., “Generation of pauses within the z-score model”, in “Progress in Speech Synthesis”, by Van Santen J.P. et al, editors. Springer-Verlag, 1997. 4. Barbosa P., “A Model of Segment (and Pause) Duration Generation for Brazilian Portuguese Text-to-Speech Synthesis”, in Eurospeech’97, Rodes. 5. Klatt, D.H., “Linguistic uses of segmental duration in English: Acoustic and perceptual evidence”, JASA, 59, 1209–1221, 1976. 6. Zellner, B., “Caractérisation et prédiction du débit de parole en français – Une étude de cas”, PhD, U. de Lausanne, 1998. 7. Salgado, Xavier F., e Banga E.R., “Segmental Duration Modelling in a Text-to-Speech System for the Galician Language”, in Eurospeech’99, Budapeste. 8. Córdoba, Vallejo, Montero, Gutierrez, López., Pardo, “Automatic Modelling of Duration in a Spanish Text-to-Speech System Using Neural Networks. Eurospeech’99. 9. Hifny, Y., Rashwan, M., “Duration Modeling for Arabic Text to Speech Synthesis”, Proceedings of ICSLP’2002. 10. Chung, H., “Segment Duration in Spoken Korean”, Proceedings of ICSLP’2002.
48
J.P. Teixeira and D. Freitas
11. Mixdorff, H., “An Integrated Approach to Modeling German Prosody”, Thesis for Dr.-Ing. Habil., Technical University of Dresden, 2002. 12. Teixeira, J.P., Freitas, D., Braga, D., Barros, M.J., Latsch, V., “Phonetic Events from the Labeling the European Portuguese Database for Speech Synthesis, FEUP/IPB-DB”, in Eurospeech’01, Aalborg. 13. Hagan, M.T., Menhaj, M., “Training feedforward networks with the Marquardt algorithm”, IEEE Transactions on Neural Networks, vol. 5, n 6, 1994. 14. Riedmiller, M., and H. Braun, “A direct adaptive method for faster backpropagation learning: The RPROP algorithm”, Proceedings of the IEEE International Conference on Neural Networks, 1993.
From Portuguese to Mirandese: Fast Porting of a Letter-to-Sound Module Using FSTs Isabel Trancoso1 , C´eu Viana2 , Manuela Barros2 , Diamantino Caseiro1 , and S´ergio Paulo1 1
L2 F - Spoken Language Systems Lab INESC-ID/IST Rua Alves Redol 9, 1000-029 Lisboa, Portugal {Isabel.Trancoso,dcaseiro,spaulo}@l2f.inesc-id.pt http://www.l2f-inesc-id.pt/ 2 CLUL Av. Prof. Gama Pinto 2, Lisbon, Portugal {mcv,manuela.barros}@clul.ul.pt http://www.clul.ul.pt/
Abstract. This paper describes our efforts in porting our letter-tosound module from European Portuguese to Mirandese, the second official language in Portugal. We describe the rule formalism and the composition of the various transducers involved in the letter-to-sound conversion. We propose a set of extra SAMPA symbols to be used in the phonetic transcription of Mirandese, and we briefly cover the set of rules and results obtained for the two languages. Although at a very preliminary stage, we also describe our efforts at building a waveform generation module also based on finite state transducers. The use of finite state transducers allowed a very flexible and modular framework for deriving and testing new rule sets. Our experience led us to believe that letterto-sound modules could be helpful tools for researchers involved in the establishment of orthographic conventions for lesser spoken languages.
1
Introduction
Mirandese is the smallest language spoken in the Iberian peninsula. It is spoken by a population that does not exceed 12,000 and covers only a region of 500 square kilometres, on the northeastern border of the country. It is a romance language, related to Asturian-Leonese, and for several centuries it was preserved only as an oral transmission language. Its recognition as official language is fairly recent (1999) and so are the efforts to create an orthographic convention [1] in order to establish unifying criteria for writing in this language1 . The motivation for deriving letter-to-sound rules for Mirandese (Mirand´es), was to build a tool that may help native speakers to learn how to read and write, as well as students interested in that language. 1
http://mirandes.no.sapo.pt
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 49–56, 2003. c Springer-Verlag Berlin Heidelberg 2003
50
I. Trancoso et al.
As a starting point, we used the rules that we had derived for European Portuguese (EP). The first letter-to-sound module that we developed for EP was in the context of a rule-based system (DIXI). In fact, none of the datadriven tools that we had developed since then (either based on neural networks [2] or CARTs - Classification and Regression Trees [3]) were suited for Mirandese, given the small amount of training material. Our most recent efforts in terms of letter-to-sound conversion were based on Finite State Transducers (FSTs) [4], motivated by their flexibility in integrating multiple sources of information and other interesting properties such as inversion. The knowledge-based approach using FSTs is flexible enough to allow easy porting to similar languages or other varieties of Portuguese. This paper describes our efforts in porting our FST -based letter-to-sound module from European Portuguese (EP) to Mirandese. We start by the description of the rule formalism (Sect. 2) and of the composition of the various transducers involved in the letter-to-sound conversion (Sect. 3). We proceed with the proposal of a set of SAMPA symbols to be used in the phonetic transcription of Mirandese (Sect. 4). The next two sections present the main results for EP and Mirandese. Finally, we describe our preliminary efforts at building other modules of a concatenative-based synthesizer using FSTs (Sec 7) and present some concluding remarks.
2
Rule Formalism
In our first rule-based system for EP (DIXI [5]), the rules were written in the usual form φ → ψ/λ ρ where φ, ψ, λ and ρ can be regular expressions that refer to one or multiple levels. The meaning of the rules was the following: when φ was found in the context with λ on the left and ρ on the right, ψ would be applied, replacing it or filling a different level of ψ. Most of the grapheme-to-phone rules were written such that φ, λ and ρ only referred to the grapheme level (with stress marks already placed on it) and ψ only to the phone level. There were no intermediate stages of representation and no rule created or destroyed the necessary context for the application of another rule. In order to prevent some common errors, a small set of 6 rules was nevertheless added which referred to grapheme-phone correspondences on either context λ or ρ. Although some similarities may be found between DIXI’s and a Two-Level Phonology approach ([6], [7]), DIXI’s rules were not two-level rules: contexts were not fully specified as strings of two-level correspondences and within the set of rules for each grapheme, a specific order of application was required. Default rules needed to be the last and in some cases in which the contexts of different rules partially overlapped, the most specific rule needed to be applied first. Our first step in the design of the FST -based rule system was to convert DIXI’s rules to a set of FSTs. In order to preserve the semantic of these rules we opted to use rewriting rules, but in the following way: First, the grapheme sequence g1 , g2 , ..., gn , is transduced into g1 , , g2 , , ..., , gn , where is an empty symbol, used as a placeholder for
From Portuguese to Mirandese
51
phones. Each rule will replace with the phone corresponding to the previous grapheme, keeping it. The context of the rules can now freely refer to the graphemes. The few DIXI rules whose context referred to phones can also be straightforwardly implemented. In this way, we avoid rule dependencies that would be necessary if we had just replaced graphemes by phones: the first rule would only have graphemes in its context, while the last ones have mainly phones. The very last rule removes all graphemes, leaving a sequence of phones. The input and output language of the rule transducers is thus a subset of (grapheme phone)∗ . The set of graphemes and the set of phones do not overlap. The rules are specified using a rule specification language, whose syntax resembles the BNF (Backus Naur Form) notation, allowing the definition of non-terminal symbols (e.g. $Vowel). Regular expressions are also allowed in the definition of non-terminals. Transductions can be specified by using the simple transduction operator a → b, where a and b are terminal symbols. This work motivated us to extend the language with two commands. The first command is: OB RULE n, φ → ψ/λ
ρ
where n is the rule name and φ, ψ, λ, ρ are regular expressions. OB RULE specifies a context dependent obligatory rule, and is compiled using Mohri and Sproat’s algorithm [8]. The second one is: CD TRANS n, τ ⇒ λ ρ where τ is a transducer (an expression that might include the → operator). CD TRANS (Context-Dependent Transduction) is a generalization where the replacing expression depends on what was matched. It is compiled using a variation of Mohri and Sproat’s algorithm, that uses π1 (τ ) instead of φ, and τ instead of the cross product φ × ψ. Its main advantage is that it can succinctly represent a set of rules that apply to the same context. We use it mainly in the stressmarking phase.
3
Transducer Composition
The rules of the letter-to-sound module are organized in various phases, each represented by transducers that can be composed to build the full module. Figure 1 shows how the various phases are composed. Each phase has the following function: – introduce-phones is the simple rule that inserts the empty phone placeholder after each grapheme. ($Letter (NULL → EMPTY)) ⇒ ). – the exception-lexicon contains the pronunciation of frequent words not covered by the rules. – the stress phase consists of rules that mark the stressed vowel of the word. – prefix-lexicon consists of pronunciation rules for compound words, namely with roots of Greek or Latin origin such as “tele” or “aero”.
52
I. Trancoso et al. introduce-phones exception-lexicon stress prefix-lexicon gr2ph sandhi remove-graphemes
o o o o o o
Fig. 1. Phases of the knowledge based system
– gr2ph is the bulk of the system, and consists of a set of rules that convert the graphemes (differentiating between diacritics) to phones. – sandhi implements word co-articulation rules across word boundaries. (This rule set was not tested here, given the fact that the test set consists of isolated words.) – remove-graphemes removes the graphemes in order to produce a sequence of phones. ($Letter → NULL / ). The following example (in EP) illustrates the specification of 2 gr2ph rules for deriving the pronunciation of grapheme g: either as /Z/ (e.g. agenda, gisela) when followed either by e or i, or as /g/ otherwise (SAMPA symbols used). OB_RULE 0200, g EMPTY -> g _Z \ / NULL ___ ($AllE | $AllI) OB_RULE 0201, g EMPTY -> g _g \ / NULL ___ NULL The compilation of the rules may result in a very large number of FSTs that may be composed in order to build a single grapheme-to-phone transducer. Alternatively, to avoid the excessive size of this single transducer, one can selectively compose the FSTs in order to obtain a smaller set that can be later composed with the grapheme FST in runtime to obtain the phone FST .
4
The SAMPA Phonetic Alphabet for Both Languages
The SAMPA phonetic alphabet for EP2 was defined in the framework of the SAM A European project and includes 38 phonetic symbols. Table 1 lists the additional symbols that had to be defined for Mirandese, together with some examples. They cover two nasal vowels, 3 non-strident fricatives corresponding to b, d, g in intervocalic position or after r, and 2 retroflex fricatives.
2
http://www.l2f.inesc-id.pt/˜imt/sampa.html
From Portuguese to Mirandese
53
Table 1. Additional SAMPA symbols for Mirandese SAMPA @˜ E˜ B D G s z
5
Orthography centelha benga chuba roda pega sol rosa
Transcription s@˜t”ejL6 b”E˜g6 tS”uB6 R”OD6 p”EG6 s ”Ol˜ R”Oz 6
Transducer Approach for European Portuguese
The transducer approach for EP involved a large number of rules: 27 for the stress transducer, 92 for the prefix-lexicon transducer, and 340 for the gr2ph transducer. The most problematic one was the latter. We started by composing each of the other phases into a single FST . gr2ph was first converted to a FST for each grapheme. Some graphemes, such as e, lead to large transducers, while others lead to very small ones. Due to the way we specified the rules, the order of composition of these FSTs was irrelevant. Thus we had much flexibility in grouping them and managed to obtain 8 transducers with an average size of 410k. Finally, introduce-phones and remove-graphemes were composed with other FSTs and we obtained the final set of 10 FSTs. In runtime, we can either compose the grapheme FST in sequence with each FST , removing dead-end paths at each step, or we can perform a lazy simultaneous composition of all FSTs. This last method is slightly faster than the DIXI system. In order to assess the performance of the FST -based approach, we used a pronunciation lexicon built on the PF (“Portuguˆes Fundamental”) corpus. The lexicon contains around 26,000 forms. 25% of the corpus was randomly selected for evaluation. The remaining portion of the corpus was used for training or debugging. As a reference, we ran the same evaluation set through the DIXI system, obtaining an error rate of 3.25% at a word level and 0.50% at a segmental level. The first test of the FST -based approach was done without the exception lexicon. The FST achieved almost the error rate of the DIXI system it is emulating, both at a word level (3.56%) and at a segmental level (0.54%). When we integrate the exception lexicon used in DIXI, the performance is exactly the same as for DIXI. We plan to replace some rules that apply to just a few words with lexicon entries, thus hopefully achieving a better balance between the size of the lexicon and the number of rules.
6
Transducer Approach for Mirandese
The porting of the FST -based approach from EP to Mirandese involved changing the stress and gr2ph transducers. The stress rules showed only small differences
54
I. Trancoso et al.
compared to the ones for EP (e.g. stress of the words ending in ¸c, n, and ie). The gr2ph transducer was significantly smaller than the one developed for EP (around 100 rules), reflecting the much closer grapheme-phone relationship. The hardest step in the porting effort involved the definition of a development corpus for Mirandese. Whereas for EP the choice of the reference pronunciation (the one spoken in the Lisbon area and most often observed in the media), was fairly easy, for Mirandese it was a very hard task, given the differences between the pronunciations observed in the different villages of the region. This called for a thorough review of the lexicon, and checking with native speakers. For development, we used a small lexicon of about 300 words extracted from the examples in [1]. For testing, we used a manually transcribed lexicon of around 1,100 words, built from a corpus of oral interviews conducted by CLUL in the framework of the ALEPG project (Atlas Lingu´ıstico-Etnogr´ afico de Portugal e da Galiza). As a starting point, we selected the interviews collected in the village of Duas Igrejas, which was also the object of the pioneering studies of Mirandese by Jos´e Leite de Vasconcelos [9]. Our first tests were done without an exceptions lexicon. In our very small development set, we obtained 11 errors (3.68% error rate at a word level), all of which are exceptions (foreign words, function words, etc.). For the test set, a similar error rate was obtained (3.09%). Roughly half of the errors will have to be treated as exceptions, and half correspond to stress errors. For more details concerning differences between the two rule sets, and a discussion of the types of error, see [10].
7
FST-Based Concatenative Synthesis
This section describes on-going work toward the development of other modules of a text-to-speech (TTS) system using FSTs. In particular, it covers the waveform generation module, which is based on the concatenation of diphones. A diphone is a recorded speech segment that starts at the steady phase of a first phone (generally close to the mid part of the phone) and ends at the steady phase of the second one. By concatenating diphones, one can capture all the events that occur in the phone transitions, which are otherwise difficult to model. Our FST -based system is in fact based on the concatenation of triphones which builds on this widely used diphone concatenation principle. A triphone is a phone that occurs in a particular left and right context. For example, the triphone a-b-c is the version of b that has a on the left and c on the right. In order to synthesize a-b-c, we concatenate the diphones a—b and b—c and then remove the portions corresponding to phones a and c. Our first step in the development of this type of system for EP was the construction of a diphone database. A common approach is to generate a set of nonsense words (logathomes), containing a center diphone as well as surrounding carrier phones. After generating the list of prompts, they were recorded in a sound proof room, with a head mounted microphone to keep the recording con-
From Portuguese to Mirandese
55
ditions reasonably constant among sessions. We also tried to avoid variations on the speaker’s rhythm and intonation, in order to reduce concatenation problems. The following step was the phonetic alignment of the recorded prompts, which was made manually. Rather than marking the phone boundaries, we need to select phone mid parts. For each triphone a-b-c, we tried to minimize discontinuities on both diphones a—b and b—c, by performing a local search for the best concatenation point in the mid parts of the two samples of b. We used the Euclidean distance between the Line Spectral Frequencies (LSF), because of their relationship to formant frequencies and their bandwidths. By avoiding discontinuities on the formants, we solve some of the concatenation problems, but not all of them. Since at the chosen points for concatenation, the signal energy may differ, the last step is to scale the speech signals at the diphone boundaries. The scale factor is the ratio between the energy of the last pitch period of the first diphone and the energy at the first pitch period of the second diphone. This scale factor will approach one as we approach the phone boundary, to avoid changing the energy of other phones. We were not very concerned with the discontinuities of the signal fundamental frequency, because, during the recording procedure, the speaker kept it pretty constant. Using the triphone database, speech synthesis can be performed by first converting graphemes into phones, then phones into triphones, and finally concatenating the sound waves corresponding to the triphones. This process can be represented as the transducer composition cascade W ◦ G2P ◦ T r ◦ DB, where W is the sentence, G2P is the grapheme-to-phone transducer, T r is the phoneto-triphone transducer and finally DB is a transducer that maps triphones into sequences of samples. The phone-to-triphone transducer T r is constructed as the composition of two bigram transducers T r = Bdi ◦Bph . The bigram transducers map their input symbols into pairs of symbols, for example, given a sequence a, b they produce (#, a), (a, b), (b, #). The bigram transducer can be built by creating a state for each possible input symbol and creating, for each symbol pair (a, b), an edge linking state a with state b with input b and output (a, b). This prototype system, which, for the time being is completed devoided of prosody modules, was only build for EP. However, the system can be used with the Mirandese letter-to-sound transducer composed with a phone mapping transducer in order to produce an approximation of the acoustic realization of an utterance in Mirandese as spoken by an EP speaker. We expect to have funding in the near future to record a native Mirandese speaker and process the corresponding database.
8
Concluding Remarks
This paper described an FST -based approach to letter-to-sound conversion that was first developed for European Portuguese and later ported to the other official language in Portugal - Mirandese. The hardest part of this task turned out to
56
I. Trancoso et al.
be the establishment of a reference pronunciation lexicon that could be used as the development corpus, given the observed differences in pronunciation between the inhabitants of the small villages in that region. The use of finite state transducers allows a very flexible and modular framework for deriving new rule sets, and testing the consistency of the orthographic conventions. Based on this experience, we think that letter-to-sound systems could be useful tools for researchers involved in the establishment of orthographic conventions for lesser spoken languages. Moreover, such tools could be helpful in the design of such conventions for other partner languages in the CPLP community. Acknowledgments. We gratefully acknowledge the help of Ant´onio Alves, Matilde Miguel, and Domingos Raposo.
References 1. M. Barros-Ferreira and D. Raposo, editors. Conven¸ca ˜o Ortogr´ afica da L´ıngua Mirandesa. Cˆ amara Municipal de Miranda do Douro – Centro de Lingu´ıstica da Universidade de Lisboa, 1999. 2. I. Trancoso, M. Viana, F. Silva, G. Marques, and L. Oliveira. Rule-based vs. neural network based approaches to letter-to-phone conversion for portuguese common and proper names. In Proc. ICSLP ’94, Yokohama, Japan, September 1994. 3. L. Oliveira, M.C. Viana, A.I. Mata, and I. Trancoso. Progress report of project dixi+: A portuguese text-to-speech synthesizer for alternative and augmentative communication. Technical report, FCT, January 2001. 4. D. Caseiro, I. Trancoso, L. Oliveira, and C. Viana. Grapheme-to-phone using finite state transducers. In Proc. 2002 IEEE Workshop on Speech Synthesis, Santa Monica, CA, USA, September 2002. 5. L. Oliveira, M. Viana, and I. Trancoso. A rule-based text-to-speech system for portuguese. In Proc. ICASSP ’1992, San Francisco, USA, March 1992. 6. K. Koskenniemi. Two-Level morphology: A general Computational Model for WordForm Recognition and Production. PhD thesis, University of Helsinki, 1983. 7. E.L. Antworth. Pc-kimmo: A two-level processor for morphological analysis. Technical report, Occasional Publications in Academic Computing No 16. Dallas, TX: Summer Institute of Linguistics, 1990. 8. M. Mohri and R. Sproat. An efficient compiler for weighted rewrite rules. In 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, USA, 1996. 9. J. Vasconcellos. Estudos de Philologia Mirandesa. Imprensa Nacional, Lisboa, 1900. 10. D. Caseiro, I. Trancoso, C. Viana, and M. Barros. A comparative description of gtop modules for portuguese and mirandese using finite state transducers. In Proc. ICPhS’ 2003, Barcelona, Spain, August 2003.
A Methodology to Analyze Homographs for a Brazilian Portuguese TTS System Filipe Barbosa1 , Lilian Ferrari2 , and Fernando Gil Resende1 1 2
Escola Polit´ecnica, Universidade Federal do Rio de Janeiro, Brazil {filipe,gil}@lps.ufrj.br Faculdade de Letras, Universidade Federal do Rio de Janeiro, Brazil [email protected]
Abstract. In this work, a methodology to analyze words that are homographs and heterophones is proposed to be applied in a Brazilian Portuguese text-to-speech system. The reasoning is based on grammatical construction. An algorithm structured on the presented methodology was implemented to solve the reading decision problem for the word sede giving rise to 95,0% of accuracy rate when tested on the CETEN-Folha text database.
1
Introduction
Homographs are words which have the same spell, but different meanings. For the development of a text-to-speech (TTS) system, cases of homographs which are heterophones are specially problematic, because whenever they occur, the algorithm that transcribes graphemes into phonemes has to decide between two possible readings. This paper provides a detailed analysis of the word sede, as a case study for the problem of homographs which are heterophones. The phonetic forms [sedi] or [sEdi] can be achieved, depending on the context. The proposed methodology relies on the notion of grammatical construction, which is being developed by workers on cognitive linguistics. Based on the presented analysis an algorithm to decide how the Brazilian Portuguese (BP) TTS system should read the word sede was implemented and tested on the CETENFolha text database [1], which contains 24 million words, with 2278 occurrences of sede. An accuracy rate of 95,0% was achieved. This article is organized as follows. In Sect. 2, some fundamental concepts of cognitive grammar are presented. Sections 3 and 4 deals with hypothesis and the corresponding analysis, respectively. In Sect. 5, experimental results are shown and discussed. Section 6 presents our conclusions. For the sake of clarity, in this article, Portuguese words and phrases will be printed in italic fonts, immediately followed by the corresponding English translation, in parenthesis. The SAMPA phonetic alphabet [2] is used in this work.
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 57–61, 2003. c Springer-Verlag Berlin Heidelberg 2003
58
2
F. Barbosa, L. Ferrari, and F.G. Resende
Fundamental Concepts
The analysis developed here lies on the framework usually referred to as cognitive grammar [3,4,5]. The central notion of this framework is the idea that grammatical structure is inherently symbolic. It can be characterized as a structured inventory of conventional linguistic units, which provides the means for expressing ideas in linguistic form. For our purposes, a particularly interesting kind of unit is a constructional schema. Constructional schemas are symbolic units which are complex and schematic that is to say, more abstract, specified in less detail. There are lowlevel schemas, such as “ANIMATE want THING”, which can be instantiated by “I want chocolate”; and higher order schema, such as “ANIMATE PROCESS THING”, that represents a structure where the subject precedes the verb, which comes before the direct object.
3
Hypothesis
The following hypothesis guided the analysis: I. For the distinction between the nouns [sedi] and [sEdi], the relevant constructions are: noun phrases, prepositional phrases and verb phrases, in which these nouns occur. II. Given its difference in meaning, [sEdi] and [sedi] will also differ in regard to the low-level schema that they instantiate. III. Although each slot in these schemas can be filled by any element of the word class associated to it, only a limited number of words will productively occur.
4
Analysis
Since the nouns [sedi] and [sEdi ] can take part in noun phrases, prepositional phrases or verb phrases, the analysis focused on different types of constructional schemas that are relevant for the distinction between them. The examples given below are samples of the occurrences for each construction. 4.1
Nominal Constructions
The following analysis presents three kinds of nominal constructions, shown in Table 1, which have to be taken into account for the choice between [sedi] and [sEdi]. The adjective slot for the Right Adjective Modified Nominal Construction can be filled by words like principal (“main”), for [sEdi], and insaci´ avel (“uncontrolled”) for [sedi]. For the Left Adjective Modified Nominal Construction the occurrence of the adjective preceding the noun is more productive with [sEdi], which tends to co-occur with words, such as: futura(“future”)
A Methodology to Analyze Homographs
59
Table 1. Schemas for nominal constructions Rigth Adjective Modified Nominal Construction Type Noun Adjective Noun Phrase [sedi] or [sEdi] adjective Left Adjective Modified Nominal Construction Type Adjective Noun Noun Phrase adjective [sedi] or [sEdi] Noun-PP construction for [sedi] Type Noun1 Preposition Noun2 Noun Phrase [sedi] de noun Noun-PP construction for [sEdi] Type Noun1 Preposition + Article Noun2 Noun Phrase [sEdi] de, em + a,o =da, do noun
and nova(“new”). As for [sedi],the adjectives muita(“much”), pouca(“little”) and bastante(“much”) often appear. Finally, for the Noun Prepositional Phrase (Noun-PP) Construction, examples of the two forms, [sedi] and [sEdi], are: sede de justi¸ca (“thirst of justice”) and sede da organiza¸ca ˜o (“seat of the organization”), respectively. 4.2
Prepositional Constructions
Since the noun [sEdi] is semantically a locative, prepositional constructions headed by locative prepositions are important for the prediction of this form. Two types of prepositional constructions are shown in Table 2. Regarding the Locative Prepositional Construction schema, it is worth noting that the prepositions a and em are normally contracted, giving the forms na and a. Examples of the Complex Locative Prepositional Construction are those who ` Table 2. Schemas for prepositional and verbal constructions Locative Prepositional Construction Type Preposition Determiner Noun Prepositional Phrase em, a, para a, uma, aquela [sEdi] Complex Locative Prepositional Construction Type Noun1 Preposition + Article Noun2 Prepositional Phrase noun de + a = da [sEdi] or [sedi] Transitive Verbal Constructions Type Adverb Preposition + Article Noun2 Prepositional Phrase adverb de + “a” = da [sEdi] Intransitive Verbal Constructions Type Verb Preposition Noun Verb Phrase verb com , de [sedi]
60
F. Barbosa, L. Ferrari, and F.G. Resende
have the Noun1 slot filled by entrada(“entrance”) or abertura(“opening”), for [sEdi] and hora(“time”) or momento(“moment”), for [sedi]. 4.3
Verbal Constructions
As for verbal constructions, described in Table 2, two main types can be found. For the Transitive Verbal Construction, verbs inaugurar (“to inaugurate”) and matar (“to kill”) occur with [sEdi] and [sedi], respectively. The intransitive verbal constructions are specially productive for [sedi]. The verbs morrer (“to die”) and estar (“to be”) are the most frequent.
5
Experimental Results
The algorithm developed to deal with the word sede was tested using the CETEN-Folha text database [1]. This database was extracted from Folha de S˜ ao Paulo, a Brazilian newspaper, and contains around 24 million words, with 2278 occurrences of sede, divided in 298 occurrences of [sedi] and 1891 of [sEdi]. Accuracy rate results for the forms [sedi] and [sEdi], as well as the overall statistics are given in Table 3. With the proposed method, the total accuracy rate achieves 95.0%, while the individual accuracy rate for [sedi] and [sEdi] are 90.6% and 95.6%, respectively. Table 3. Results for the word sede Phonetic form occurrences accuracy [sEdi] 1891 95.6% [sedi] 297 90.6% [sEdi] + [sedi] 2278 95.0%
6
Conclusions
In this paper, a methodology to deal with homographs in BP TTS systems is proposed. The basic idea relies on cognitive grammar. The word sede was used as a case study and the corresponding analysis was presented. The implemented algorithm was applied to CETEN-Folha text database giving rise to an accuracy rate of 95.0%. Using a similar framework to solve the problem for other heterophone and homograph words is subject of ongoing research.
References 1. Corpus de Extractos de Textos Electrˆ onicos NILC/Folha de S˜ ao Paulo (CETENFolha). http://acdc.linguateca.pt/cetenfolha/ 2. Speech Assessment Methods Phonetic Alphabet (SAMPA). http://www.phon..ucl.ac.uk/home/sampa
A Methodology to Analyze Homographs
61
3. Goldberg, A.: Constructions: A Construction Grammar Approach to Argument Structure. university of Chicago Press.(1995). 4. Langacker, R.: Foundations of Cognitive Grammar, vol 1: Theoretical Prerequisites. Stanford, California: Stanford University Press.(1987). 5. Langacker, R. : Foundations of Cognitive Grammar, vol 2: Descriptive Application. Stanford, California: Stanford University Press.(1991)
Automatic Discovery of Brazilian Portuguese Letter to Phoneme Conversion Rules through Genetic Programming 1
2
Evandro Franzen and Dante Augusto Couto Barone 1
UNISC – Universidade de Santa Cruz do Sul Av. Independência, 2293-Bairro Universitário CEP 96815-900 Santa C. do Sul – RS [email protected] 2 UFRGS – Universidade Federal do Rio Grande do Sul. Instituto de Informática Av. Bento Gonçalves, 9500 – Campus do Vale – Bloco IV. Bairro Agronomia – Porto Alegre – RS – Brasil. CEP 91501-970 Caixa Postal: 15064 [email protected]
Abstract. Letter to phoneme conversion is a basic step in Speech Synthesis processes. Traditionally, the activity involves the implementation of rules that define the mapping of letters into sounds. This paper presents results of the application of an evolutionary computation technique (Genetic Programming), in Brazilian Portuguese synthesis, aiming to discover automatically programs implementing specific synthesis rules.
1
Introduction
The spoken language is the most used form of communication between humans, being simultaneously powerful and simple. The interaction between men and machines continues to be a hard problem to solve, and the application of Artificial Intelligence techniques to perform these tasks is not straight forward. [5]. Automatic speech processing performed by computers is mainly performed by two different kinds of problems: speech recognition and speech synthesis [5]. The first aims to convert an acoustic signal, captured by a microphone or by a telephone, into a set of intelligible words or phrases [8]. The second one consists in automatic generation of voice waveforms, commonly generated from a written or a stored text [8]. One of the most common approaches for performing speech synthesis is Text To Speech (TTS) technique. In this approach, a text is converted into a set of phonemes which are gathered to produce synthetic "voice" signals [4]. In most of the world's spoken languages, a written text does not correspond to its pronunciation; thus to describe the correct pronunciation, a set of symbolic representations become necessary. Each language possesses some intrinsic characteristics as different phonetic alphabet, set of possible phonemes and its combinations. Each language possess a specific set of phonemes, which can be defined as "elementary" sounds, used as "bricks" to construct any used sound found in speech productions, using that considered language [8]. In many languages there isn't an exact consistency N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 62–65, 2003. © Springer-Verlag Berlin Heidelberg 2003
Automatic Discovery of Brazilian Portuguese Letter to Phoneme Conversion Rules
63
between phonemes and the corresponding letters (graphemes) that can produce them [11]. The present work is part of the Spoltech Project [2], which aims to create, develop and provide technologies of speech processing: speech recognition and synthesis to Brazilian Portuguese. As synthesis procedure we use concatenation of diphones. Letter to phoneme conversion is done through described rules in the LTS (letter to sounds) module. One of the major goals of this present work is to provide the Spoltech synthetizer used tool (Festival environment [12]) with additional advanced technologies, based in Genetic Programming, to accomplish the processes that compose the speech synthesis.
2
Modeling the Problem Using Genetic Programming
The rules to convert letters into phonemes can be represented through computer programs. If they were implemented by human programmers, they probably would have this form: IF (current letter is x) THEN the phoneme is y ELSE... In accordance with [6], there are five major steps in preparing the utilization of genetic programming to solve a specific problem: i) determining the set of terminals; ii) determining the set of functions; iii) determining the fitness measure and the fitness cases; iv) determining the parameters and variables for controlling the run, and; v) determining the method of designating a result and the criterion for stopping a run. The activity of converting letter into phonemes can be summarized as the application of rules on words or letters to discover corresponding phonemes. Being this, the initial step for determining one or more strategies to find solutions through the Genetic Programming is to define if fitness cases will consist of a set of letters or words. In our case, we are using words to have the fitness measured. The set of fitness cases is defined as a word list, each one composed by a list of letters ((p a t o) (t i p o)). The correct answers for each case is specified as lists of phonemes in the same way ((p a t o) (ts i p o)). The definition of cases and respective answers in a list form was chosen by the easiness found in the LISP language to deal with list processing; however other data structures can be used, not compromising the solution search process. To evaluate the produced answer for each individual, the Genetic Programming system [3] calculates the raw fitness using the following rules: • In the case that the produced phoneme is equal to the expected in the specific position, three points are credited to the solution; • in the case that the phoneme produces differently in a considered position, but represents an expected phoneme in any other word's position ,one point is credited to the solution. Standardized fitness must always indicate better solutions having lesser values, tending to zero. In this problem we start with the biggest expected value for fitness, diminishing it as solutions evolve.
64
3
E. Franzen and D.A.C. Barone
Experimental Results
The set of rules to convert a letter into a phoneme in Portuguese is extremely wide. Silva [11] describes a set of more than eighty rules that cover the diverse contexts where letters are used. The main rules that can be considered in our Genetic Programming system are: i) direct relation between letter and phones, searching the special occurrence of letters "t" and "d" before "i"; ii) letter "c" when used before letters "e" or "i" is represented by phoneme [s], in the other cases by [k] phoneme; iii) letter "s" represents phoneme [z] in two situations, when it occurs between two vowels or when it is used before voiced consonants ("d", "b", "m", "g", "n"); iv) the use of "ss" results in the production of only one phoneme [s]; v) letter "z" correspond to phoneme [s] when there is a vowel at the end of the word. In the cases where it’s followed by a vowel, the phoneme corresponds to the letter itself. P o p u la tio n fi tn e ss a v e r a g e B e st g e n e r a ti o n fitn e ss B e st e x e c u tio n fitn e ss W o r st g e n e r a t io n fit n e ss
F it n e s s e v o lu t io n
250
200
150
100
50
96
92
88
84
80
76
72
68
64
60
56
52
48
44
40
36
32
28
24
20
16
8
12
4
0
0
Fig. 1. Graphic of fitness evolution
The standard fitness of the best individual was found at generation 60, corresponding to the value 30. Optimized solutions tend to lower values. A "perfect" solution corresponds to 0, since in our modelling we have used as fitness measure the difference between the expected value of the pronunciation of a string of words (correct one by definition) and the obtained fitness value through the application of rules described above. In Fig. 1, we show the fitness evolution to one of the tested cases in the work.
4
Final Conclusions
This article has presented a technique for the automatic discovery of programs and also has showed the results of its application in the automatic obtention of rules to convert a letter into a phoneme to Brazilian Portuguese. The realized experiments
Automatic Discovery of Brazilian Portuguese Letter to Phoneme Conversion Rules
65
have demonstrated that it is possible to construct systems able to discover rules through a supervised learning technique using Genetic Programming. The research was developped in the context of the SPOLTECH project (International cooperation between Brazil (UFRGS) and USA (University of Colorado), adding a specific tool for Portuguese phonetic rules and using Genetic Programming approach. The easiness offered by the representation of the technique which uses common instructions in a programming language offers a flexibility and easiness for the approach of different problems in speech synthesis. One of the major difficulties found in the results analysis consists in describing properly the implemented activity by each solution. This comes directly from the increase of complexity of the solutions and also the number of instructions that constitute the solution individuals. Another important issue is the definition of a set of fitness cases which can represent properly the rules to be discovered, without compromising other contexts where some letters can belong to.
References 1.
Banzhaf, Wolfgang; Nordin, Peter; Keller Robert; Francone, F.D. Genetic Programming An Introduction. On the Automatic Evolution of Computer Programs and Its Applications. San Francisco:Morgan Kaufmann, 1998. 470 p. 2. Spoltech. Advancing Human Language Technology in Brazil and the United States Through Collaborative Research on Portuguese Spoken Language Systems. In: PROTEMCC, 4., 2001, Rio de Janeiro. Projects Evaluation Workshop: international cooperation: procedings. Brasília: CNPw, 2001. p. 118–142. 3. Branko Soucek; Iris Group. Dynamic, Genetic and Chaotic Programming. New York: John Wiley & Sons, 1992. 4. Dutoit, Thierry. An Introduction to Text-To-Speech Synthesis. Dordrecht: Kluwer Academic, 1996. 280 p. 5. Hausser, Roland. Fundations of Computacional Linguistics. Man-Machine Comunication in Natural Language. Berlin: Springer-Verlag, 1999. 534 p. 6. Koza, John R. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge: The MIT Press, 1992. 819 p. 7. Koza, John R. Genetic Programming II: Automatic Discovery of Reusable Programs. Cambridge: The MIT Press, 1998. 746 p. 8. Lemmetty, Sami. Review of speech synthesis technology. Disponível em:
Experimental Phonetics Contributions to the Portuguese Articulatory Synthesizer Development Ant´ onio Teixeira , Lurdes Castro Moutinho, and Rosa L´ıdia Coimbra Universidade de Aveiro 3810 193 Aveiro, Portugal [email protected]
Abstract. In this paper we present current work and results in two Experimental Phonetics projects motivated by our ongoing development of an articulatory synthesizer for Portuguese. Examples of analyses and results regarding glottal source parameters and from EMMA and acoustic analyses related to the tongue position are presented. In our studies contextual and regional variation is considered.
1
Introduction
Our articulatory synthesizer for Portuguese consists, currently, in an application that runs in Windows environment and allows to synthesize, among others, nasal sounds, vowels and nasal consonants with acceptable quality [1]. It is however our intention to integrate other knowledge, obtained through an integrated and multidisciplinary contribution of researchers from different areas. The aim of this communication is to give account of ongoing projects that will contribute to the improvement of the synthesizer.
2
Phonetics Applied to Speech Processing
The research accomplished previous to this project showed that the variation of the velum, and even of some other articulators, influences the production and perception of nasality. However, detailed production and acoustic studies do not exist. Information concerning the behavior of the glottal source during the production of these vowels is also necessary for the continuation of this work, as well as information concerning regional variation. EMMA Corpus. In a previous project, information concerning the position of the tongue, lips and velum during the production of words and phrases containing nasal sounds, using a system of ElectroMagnetic Midsagittal Articulography (EMMA) has already been collected. This technique, however, is not viable for
Funded by FCT projects POSI/36427/PLP/2000 and PRAXIS P/PLP/11222/1998.
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 66–69, 2003. c Springer-Verlag Berlin Heidelberg 2003
5.2
5.6
5.0
5.4
5.8
0.0
1.0 0.0 −1.0
4.5 0.0
5.0 5.5 un u un u uun u u u unuuuunu u uu uu unun u un un u u un un un uun u u ununun un un un un un ununun unun un u un un un unun un unun un unununun un unun un un un unun
67
an an an an an an an an an an anananan an an an a an an 6 an a a a an6 an an an an an an 6 an 6 a an an a an 6 an an a a6 6 a an an 6an 6 an an aan 6 an an an 6 6 an an 6 an an a 6anaan an an anan an a6an an an an an an 6an an an an an an 6an an an an an an an an anaan 6 a a an an an an an 6an an aan an aa6an an an ananan aan 6an6anan an an aan an an an an aan an a
4.5
−0.4
5.2 5.6 un un un u un un u u u u un un unun unun un un u u unun un u uuun un uun ununun unu unun un uun uun un un un unuun un u un u uun un unun un uuun uu un unununun unun
4.8
4.8
anan an an an an an an an ananan an an an an an an an an an 66aan an an anan 6anan anan 66ana an an an an an an an an an an an an an an an an an an an an an 6an an 6an66an 6an an an an an an an an an an aan an an an anan an 6a6aana6 an an an an an 6 an an an an an an an an an 6 an 6an aaanan an an an a an 6 an an an a an an a 6 aa an anan an aaa6aa6aaanaaaan 6 aaan
−1.0
−0.8
0.0
4.4
−1.0
6 an 6 an an an 6an anananan 6anan aan an an an an an an 6 an an an an an an an an an an aanan an anan an an an an an an an anan an anan 6an an an anan an 66 6an an an6an an aan an an an anan an an an an aa an an aan n6an aan 6 an 6 an anan 6 an an 6 aan 6 an a a an a6aan an anan a6an 66 an aa an a a an an an anan an a a an an an a an an a a an a anaan a 6 aanan 6anan an anan an an an an6
−1.0
−1.5
−0.5
0.5
Experimental Phonetics Contributions
5.0
5.5 un
6.0
un
u u u u u u un un u un uunuuun u u ununuu ununun unuunun u un u un un un un un ununun un un u un un uun nunun unun unun unununu un un un un ununun un 4.6
5.0
5.4
5.8
Fig. 1. Plot of the dorsum tongue sensor at 3 different points (10%, 50% and 90% of duration) of [6∼], [a] and [6] pronunciations (top) and [u∼] and [u] (bottom)
a wide number of speakers nor does it supply information concerning the phonation process. Analysis of this corpus already contemplated the study of velum movement between stops and after nasal consonants [2] and is now addressing new questions. When trying to perform articulatory synthesis of Portuguese nasal vowels we were faced with the inexistence of accurate information regarding oral articulators and velum positions used in their production. Situation is worst for vowels [6∼], [e∼] and [o∼] ”corresponding” each to two oral vowels. Figure 1 presents information regarding tongue position for the set [6∼], [a] and [6] and for [u∼]/[u]. For [6∼], tongue configurations cover both oral [a] and [6], being more noticeable in configurations near the beginning. Observing the ellipses, representing points at 1 std from the mean, [6∼] assumes mostly an [6] configuration. At the end tongue seems to use also configurations somewhat different from [a] and [6], possibly due to coarticulation with the following segment. Configurations for [u∼] are very similar to the ones used for [u], especially at the beginning of the nasal vowel. Our data and analysis method allow similar studies with the other nasal vowels and factoring context and accent. New Acoustic Corpus. With the objective to continue this study, a new corpus was defined and organized so as to contemplate the different phonetic contexts [3]. The recordings include already several regional variants: Minho, Douro Litoral, Beiras (Litoral and Interior), Alentejo and Algarve. Speakers aged 35 and more were chosen. They were born and living in the selected areas, and did not have more than compulsory schooling. The recordings have always been done locally. The signal was registered directly into the hard disk. During recordings
68
A. Teixeira, L.C. Moutinho, and R.L. Coimbra Open Quotient for EP female oral and nasal vowels
0.0
0.0
0.4
0.4
0.8
0.8
Open Quotient for EP male oral and nasal vowels
a
6
6~
e
e~
E
i
i~
o
o~
O
u
u~
a
6
6~
e
e~
E
i
i~
o
o~
O
u
u~
Fig. 2. Open quotient for EP vowels. At left, values as function of vowel and nasality (separated by gender). Gray is used for nasals
visual stimuli were used, whenever possible. Pictures lead the informer to produce the intended words, thus avoiding reading. Two repetitions of the corpus were requested to each speaker. One area where information for developing the articulatory synthesizer is scarce is source related parameters. Having recorded EGG signal simultaneously with speech our corpus is well suited to extract such parameters. A detailed analysis was already performed using 6 male speakers of 3 regions. Results have been presented in a conference and are submitted [6]. In Fig. 2, the boxplot of the open quotient by vowel and nasality is presented. The figure does not show significant differences on average values nor dispersion for the vowels, oral or nasal. This parameter proved to be idiolectal. To complement EMMA corpus information, we are now starting the extraction and analysis of tract related parameters from our new acoustic corpus. As part of a study of EP nasal vowel height [4], we are analyzing first two formants at the very beginning of nasal vowels after stop consonants. Using F1 as a measure of vowel height, for the average of all regions, speakers and contexts, [6∼] height is between [a] and [6], [o∼] height is between [O] and [o], and [e∼] height is between [E] and [e]. The other two, [u∼] and [i∼], have height similar to the corresponding orals. Looking at F1 results for different regions, in Fig. 3, the overall tendency is not observed in some situations: for the Beira Litoral informants [o∼] is more like [o], regarding height; for speakers native of Beira Interior [o∼] is as high as [O] and [e∼] as high as [E]; for Minho speakers raising of [6∼] seems not to occur, being [6∼] more like [a].
3
Multimedia Prosodic Atlas for the Romance Languages
This project is part of a research supervised by the Centre de Dialectologie, Universit´e Stendhal, involving several European universities, and its main goal is to study the prosodic configuration of the spoken linguistic varieties in the Romance dialectological space. The study focus on vocalic segments, since it is considered that they are the ones that carry most of the relevant prosodic information, and also because this is the methodology used by all other European AMPER teams. The parameters analysed are duration, pitch and intensity of vowels [5].
Experimental Phonetics Contributions
800
Region 3 − Beira Litoral
800
Region 2 − Alentejo
69
400 200
200
400
600
ORAL NASAL
600
ORAL NASAL
a
6
6~
e
e~
E
i
i~
o
o~
O
u
u~
a
6
6~
e
e~
i
i~
o
o~
O
u
u~
u
u~
800
Region 5 − Minho
800
Region 4 − Beira Interior
400 200
200
400
600
ORAL NASAL
600
ORAL NASAL
a
6
6~
e
e~
E
i
i~
o
o~
O
u
u~
a
6
6~
e
e~
E
i
i~
o
o~
O
Fig. 3. First formant for EP oral and nasal vowels for 4 different regions
4
Conclusion
Results presented allowed to minor some gaps in information needed for articulatory synthesis of EP vowels, especially the nasals. Information was obtained about: open quotient and F0 for the different nasal vowels; oral tract configuration employed in the production of nasal vowels. At this moment synthesis experiments may be done using new data for European Portuguese. Another important result of these two projects is the formation of new corpora, including data available for the first time for EP, having a great potential for further studies.
References 1. Teixeira, A., et al: SAPWindows - Towards a versatile modular articulatory synthesizer. IEEE-SP Workshop on Speech Synthesis (2002). 2. Teixeira, A., Vaz, F., European Portuguese nasal vowels: An EMMA study. Proc. Eurospeech (2001). 3. Moutinho, L. et al: Contributos para o Estuda da Varia¸c˜ ao Contextual e Regional do Portuguˆes Europeu. Encontro Comemorativo dos 25 Anos do CLUP (2002) 5–17. 4. Teixeira, A. et al: Production, Acoustic and Perceptual Studies on European Portuguese Nasal Vowels Height. ICPhS (2003) . (accepted). 5. Contini, M. et al: Un projet d’Atlas Multim´edia Prosodique de l’Espace Roman. Speech Prosody (2002) 227–230. 6. Teixeira, A.: Para a melhoria da s´ıntese articulat´ oria das vogais nasais do Portuguˆes Europeu: Estudo da dura¸ca ˜o e de caracter´ısticas relacionadas com a fonte glotal. 1o. Cong. Int. Fon´etica e Fonologia, Belo Horizonte, (2002).
A Study on the Reliability of Two Discourse Segmentation Models Eva Arim, Francisco Costa, and Tiago Freitas ILTEC, Rua do Conde de Redondo 74, 5º, 1100-109 Lisboa, Portugal {earim,fcosta,taf}@iltec.pt
Abstract. This paper describes an experiment we conducted in order to test the reliability of two discourse segmentation models which have been widely used in computational linguistics. The main purpose of the test is to pick one of them for our future research, which aims to assess the role of prosody in structuring discourse in European Portuguese. We compared the models of Grosz and Sidner (1986) and Passonneau and Litman (1997) using spontaneous speech. The latter displayed a higher level of consensus among coders. We also observed that listening to the original speech influenced the level of agreement among coders.
1
Introduction
The present study describes one of the initial tasks of a project currently being developed with the purpose of investigating how certain prosodic features are used to mark the information structure of spoken discourse and which cues are most relevant for the listeners to identify this structure. There have been no such studies concerning the Portuguese language so far. This project can thus contribute for a better understanding of the role of prosody in natural language, providing valuable information for computational linguistics. Additionally, it will enable us to compare Portuguese with other languages regarding macro-level prosody. Following the claims of several authors, we assume that there is a relationship between discourse structure and prosodic features. Crucially, our long term goal is to explain how exactly that relationship holds in European Portuguese. There have been some studies showing that prosody is constrained by discourse structure in several aspects, and this structure has been characterized in terms of paragraphs, discourse segments, topic units or theme shifts, all of which can be regarded as essentially the same type of discourse constituent. It has been stated that if we want to identify the role of prosody in the structuring of information, we must compare it with an independently obtained discourse structure, in order to minimize the risks of circularity [1–5]. Previous work on other languages has shown that there is no direct match between syntactic structure and prosodic constituency – see [6] and [7]. Instead, prosody seems to be constrained by semantic and pragmatic aspects. Therefore, we should not rely on syntax for that matter, which would otherwise be the most immediate choice. N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 70–77, 2003. © Springer-Verlag Berlin Heidelberg 2003
A Study on the Reliability of Two Discourse Segmentation Models
71
In order to have some sort of information structure against which prosody can be confronted, some authors elicit instruction monologues, a method which yields speech with a discourse structure determined a priori [2–3; 5]. Others rely on discourse segmentations resulting from discourse analysis [8–17], whereas still others ask subjects to segment texts according to their idea of paragraph [4]. All these approaches thus assume that spoken discourse exhibits a structure somewhat similar to that of written texts, on what concerns the grouping of sentences into larger units like paragraphs, for instance. We opted for the second method, which has the advantage of making it possible to study different speech styles. This would be impossible if we were to follow the instruction monologues approach, since it generates a very specific kind of data. The problem with using the discourse analysis approach is that a priori we do not know whether it will yield more than an individual’s intuition of discourse structure. If we are to depend on a discourse segmentation method, we must assure that we are employing one that is reproducible, because the more replicable a discourse segmentation model is, the stronger the evidence that discourse structure does exist. This paper reports an experiment we have conducted in order to compare two discourse segmentation models. We chose the models of Grosz and Sidner [9] and Passonneau and Litman [17]. These have been widely used and there is extensive research on them, which allows us to compare our results with those obtained in work done for other languages. Both models produce intention based segmentations. The difference is that while the former generates a hierarchical structure the latter generates only a linear kind of segmentation, and it actually comes very close to asking subjects to segment texts based on an intuitive notion of paragraph, which is actually a method that has been used by some authors (e.g. [4]) in order to elicit discourse structure. Grosz and Sidner's model, on the other hand, produces not only segmentation, but also a hierarchical organization among discourse units similar to that of chapters and subchapters (but the units involved are obviously much smaller). The experiment consisted in asking naïve coders to segment two texts following one of the two models and then measuring the level of consistency among annotators. In Examples 1 and 2 we present examples of segmentation produced with the two models, taken from two subjects' responses. Example 1. Discourse segmentation produced with Grosz and Sidner's Model (the I label precedes a segment's description; indentation denotes a segment's embeddedness) I: Interview about the referendum I: The interviewer begins the interview I: The interviewer presents the interviewee Inês Serra Lopes, eh, directora do jornal independente, I: The interviewer asks the interviewee's opinion destes relatos que ouviste e agora das opiniões do Zé Manuel e do… e do Miguel Portas, há alguma coisa que… que te tenha chamado a atenção? I: The interviewee answers the question Não… O que me chama de facto mais a atenção, e eles já falaram sobre isso, I: The interviewee introduces the problem at hand é… o… problema da enorme distância entre a regionalização proposta e o envolvimento das pessoas nela.
72
E. Arim, F. Costa, and T. Freitas
Example 2. Examples of discourse segmentation produced with Passonneau and Litman's model (the I label precedes a segment's description) I: Presentation of the newspaper's director Inês Serra Lopes, eh, directora do jornal independente, I: Contextualization and question destes relatos que ouviste e agora das opiniões do Zé Manuel e do… e do Miguel Portas, há alguma coisa que… que te tenha chamado a atenção? I: Answer Não… I: Tells what strikes her most O que me chama de facto mais a atenção, e eles já falaram sobre isso, é… o… problema da enorme distância entre a regionalização proposta e o envolvimento das pessoas nela.
We will eventually choose one of these models for our future research on prosody. Such choice will be based on a test we carried out with the purpose of evaluating inter-coder agreement. Several works have correlated discourse structure with some prosodic variables. For instance, it appears that the relevant domain for F0 declination is a discourse segment [2]. In fact, some authors have discovered that segment initial utterances correlate with changes in pitch range [8] and display higher average and maximum F0, whereas segment final phrases present lower values for both maximum F0 and average F0 [11]; low F0 is associated with listeners' perception of both sentence and paragraph boundaries [26]. Low-ending contours seem to convey finality, and high-ending ones convey continuation [3, 4]. Pause has also been identified as a marker of discourse organization, coinciding with the boundaries of discourse segments [2, 3, 4, 8, 11, 26] or narrative boundaries [25], and the final word in the final utterance of these units also tends to be lengthened [5].
2
Method
The data used have previously been collected for REDIP [18], a project that aims at collecting and studying the language of Portuguese media, dealing mostly with radio and TV broadcasts. One of the reasons we are using this corpus is because it contains a large amount of spontaneous speech. The importance of using spontaneous speech in this kind of work has to do with the fact that spontaneous discourse can be prosodically different from prepared or read speech. One of the applications of this type of work is to make speech technology more natural sounding and more efficient in recognizing natural speech. For this test, we have selected two excerpts from the corpus. Both were digitally recorded and feature a total time of 192 seconds, containing 504 words. They consist of interviews from the radio, involving both male and female speakers. Using dialogues in this kind of work is novelty, and it will allow comparison to other speech styles. One of our concerns in choosing the dialogue samples was to make sure that they contained speech turns long enough for coders to identify more than one dis-
A Study on the Reliability of Two Discourse Segmentation Models
73
course segment within each turn. That way we prevented our participants from placing discourse segment boundaries exclusively at turn boundaries, since our long term interest is in the prosodic means of signaling discourse structure and not in the prosodic strategies used to signal turn taking. We have asked sixteen naïve coders to annotate these two transcripts using the previously mentioned models. The participants were split into two different groups according to the model they were asked to work with. They all received an orthographic transcription of the selected texts, but for each model only four of them listened to the original recordings. Since we hypothesize that there is a relation between discourse structure and prosody, we expected the listening and non-listening groups to display a different behavior. Each participant received a set of instructions which were basically the explanatory texts of [15] and [17] translated with slight modifications. The most significant change we introduced was that people were not restricted to placing segment boundaries at prosodic phrase boundaries previously determined. They could place them between any two words in the text instead. We believe the results obtained this way are more independent from prosody.
3
Results
In order to measure inter-coder agreement, we employed the kappa coefficient (κ), which has recently been considered to be the most adequate for that purpose (see [19] and [20]). Kappa values under 0.6 indicate there is no statistical correlation among coders, whereas results over 0.7 point to replicable coder agreement. Although percent agreement might seem a valid statistic for this purpose, it actually overestimates inter-subject agreement, because it does not take into account the fact that from all the possible boundary sites only a few will be considered a discourse boundary (discourse segments identified by our subjects averaged 26 words in length, and a priori one expects a large number of possible boundary sites where no subject will place a segment boundary). Because of this, percent agreement will report a high consensus among annotators even if they all assign discourse segment boundaries to different places. The kappa coefficient, on the other hand, is not influenced by the fact that, to a large extent, subjects will agree due to the nature of the task, because it subtracts chance agreement from the observed agreement. We computed the pair-wise kappa coefficient between all the possible pairs of coders within the same group. This yielded a total of six pairs for each of the four groupings (four coders each), and twenty eight pairs for each model (eight coders each). The coefficient is computed as follows (from [20]):
74
E. Arim, F. Costa, and T. Freitas
C = number of boundaries agreed upon by both subjects D = number of boundaries assigned by subject A but not by subject B I = number of boundaries assigned by subject B but not by subject A N = number of non-boundaries agreed upon by both subjects
T = C + D + I + N C + N Po = T Pc =
κ =
( C + D )( C + I ) ( N + D )( N + I ) + T T
Po − Pc 1 − Pc
(Percent agreement corresponds to Po in the above formulas, and Pc stands for chance agreement.) It should be noted that in order to compare these two models we had to discard the hierarchical information that Grosz and Sidner’s framework supplies, since Litman and Passonneau’s produces linear segmentation. Therefore, the results obtained pertain only to the location of discourse segment boundaries. As can be seen in the table below, our results show that Passonneau & Litman’s discourse segmentation model produces higher inter-coder agreement values (average kappa= 0.73, min.=0.58, max.= 0.92), outpacing those of Grosz and Sidner’s (average kappa=0.65, min.=0.39, max.=0.86) by almost ten points. This is a significant contrast in terms of reproducibility, with Grosz and Sidner’s model below the 0.7 mark and Passonneau and Litman’s above it. We also found that 67% of the pair-wise comparisons result in a kappa value of at least 0.7 (96% for the 0.6 threshold) for Passonneau and Litman's model, while this number is only 33% (respectively 79%) for Grosz and Sidner's. We also present percent agreement, for the sake of comparison with other studies. Table 1. Observed coder agreement
Grosz & Sidner’s Model
kappa coefficient Listening Non-Listening Overall percent agreement Listening Non-Listening Overall
avg.
min.
max.
Passonneau & Litman’s Model avg. min. max.
0.59 0.68 0.65
0.39 0.55 0.39
0.72 0.86 0.86
0.74 0.69 0.73
0.66 0.58 0.58
0.85 0.87 0.92
0.96 0.97 0.96
0.93 0.95 0.93
0.97 0.99 0.99
0.98 0.98 0.98
0.98 0.97 0.97
0.99 0.99 0.996
A Study on the Reliability of Two Discourse Segmentation Models
75
We think that the poorer results of Grosz and Sidner’s model might be ascribed to its inherent complexity. The fact that coders had to identify relations between segments caused higher variation among subjects. Listening to the speech recordings did influence the results, but not quite as we expected, considering that other studies report higher levels of agreement in the listening condition. Our findings show that coders using Grosz and Sidner’s model agreed less when listening to the recordings. The different scores between the listening and the non-listening groups corroborate the hypothesis that discourse structure is reflected in prosody. In Litman and Passonneau’s model the effect of hearing the speech shows up in a positive way, suggesting that prosody can make discourse structure more explicit. On the contrary, in Grosz and Sidner’s model, access to prosodic information might have caused people to look for prosodic means of signaling hierarchy between segments, resulting in a more disparate segmentation. In fact, some authors comment that it has not been proved if prosody can signal the embeddedness level of discourse segments – [4]. It is important to remember that these results were obtained using spontaneous dialogues, whereas other authors have used monologues. The fact that Litman and Passonneau’s model scored well demonstrates that it can be applied to dialogues and suggests that an identifiable discourse structure can be found not only in monologues but also in dialogues. We now compare our results with others reported in studies by other authors using Grosz and Sidner's model. [8] arrived at 95.1% consensus among subjects labeling from text alone and 87.7% agreement among subjects that segmented the texts while listening to the speech recordings; [11] presents several figures, according to whether the annotators listened to the sound files or not and whether the speech samples consisted of spontaneous or read speech (38% among subjects who did not listen to the recordings and worked with read speech; 75% among subjects who listened to the recordings and worked with read speech; 46% among subjects who did not listen to the recordings and worked with spontaneous speech; 78% among subjects who listened to the recordings and worked with spontaneous speech), but it should be noted that hierarchical relations among segments were included in the computation of the figures exhibited in [11] (which, at least in part, accounts for why our figures are much higher). The picture that seems to emerge from these data, and which is consistent with our findings, is that coders seem to agree less on boundary location when they have access to sound, but agree more on the hierarchical relations among segments when they label from both speech and text. Passonneau and Litman [17] report the following measures of inter-coder agreement: 0.74 recall, 0.55 precision, 0.09 fallout; 0.11 error (see [17] or [20] on how to compute these). They were calculated using the majority opinion as reference (i.e., assuming the majority is right). If a subject identifies all 'correct' boundaries, recall is 1; if a subject does not identify more boundaries than the ones that are 'correct', precision is 1, fallout measures how many non-boundaries were identified as boundaries and error tells how deviant a subject response was from the majority opinion (ideally, recall and precision should be 1 and fallout and error 0). We present our corresponding results (for Passonneau and Litman's model): 0.79 recall, 0.9 precision, 0 fallout, 0.01 error for the listening group; 0.82 recall, 0.84 precision, 0.01 fallout and error for the non-listening group; 0.81 recall, 0.87 precision, 0 fallout and 0.01 error overall (we used the segmentation resulting from the boundaries identified by at least 9 sub-
76
E. Arim, F. Costa, and T. Freitas
jects out of 16 as reference). Our results show that listening to the speech files increased precision, but also caused a small decrease in recall (meaning that the listening group that used this method of segmentation divided our texts in not as many discourse units as the corresponding, non-listening group). Our findings strongly suggest that dialogues do possess some sort of clear information structure. However, we must acknowledge that the consensus level obtained in this experiment (reported with percent agreement and recall, precision, etc.) is much higher than the one obtained with other experiments to a large extent because we allowed subjects to segment texts between any two words, which greatly increased the number of possible boundary sites that were not classified as the boundary of a discourse unit by any subject. Nonetheless, we also present kappa values, which presumably are not affected by this.
4
Conclusions
The two models employed in this experiment use speaker intention as a criterion to segment discourse. When participants were instructed to segment discourse, they were also asked to provide a description of the intentions underlying each segment. We want to use that information in a future analysis to check whether different segmentations were caused by discourse ambiguity. This may lead to different results. The experiment described in this paper was only a preliminary study to enable us to choose a discourse segmentation method that will be used in our work on the relationship between discourse and prosody in European Portuguese. Testing the two models is not the end goal of our project, but simply a preliminary experiment. This meant that we did not work with a large corpus, but we are aware that a larger one could have created a different picture. The results observed so far lead us to choose Passonneau and Litman’s model for our future research. As was shown, this method displayed a fair level of inter-coder consensus, well above Grosz and Sidner’s. If the level of agreement obtained proves not to be satisfactory for the purpose of our research, we may adapt the chosen model in order for it to produce results further above the 0.7 mark.
References 1. 2. 3. 4. 5. 6.
Swerts, M. and R. Collier: On the Controlled Elicitation of Spontaneous Speech. Speech Communication 11 (4–5) (1992) 463–468 Swerts, M. and R. Geluykens: The Prosody of Information Units in Spontaneous Monologue. Phonetica 50 (1993) 189–196 Swerts, M. and R. Geluykens: Prosody as a Marker of Information Flow in Spoken Discourse. Language and Speech 37 (1) (1994) 21–43 Swerts, M.: Prosodic Features at Discourse Boundaries of Different Strength. Journal of the Acoustical Society of America 101 (1) (1997) 514–521 Swerts, M., R. Collier and J. Terken: Prosodic Predictors of Discourse Finality in Spontaneous Monologues. Speech Communication 15 (1994) 79–90 Cutler, A., D. Dahan and W. Donselaar: Prosody in the Comprehension of Spoken Language: A Literature Review. Language and Speech 40 (2) (1997) 141–201
A Study on the Reliability of Two Discourse Segmentation Models 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
77
Pijper, J.R. and A.A. Sanderman: On the Perceptual Strength of Prosodic Boundaries and its Relation to Suprasegmental Cues. Journal of the Acoustical Society of America 96 (4) (1994) 2037–2047 Grosz, B. and J. Hirschberg: Some Intentional Characteristics of Discourse Structure. Proceeding of the International Conference on Spoken Language Processing (1992) 429–432 Grosz, B.J. and C.L. Sidner: Attention, Intention and the Structure of Discourse. Computational Linguistics 12(3) (1986) 175–204 Hirschberg, J. and B. Grosz: Intonational Features of Local and Global Discourse Structure. Proceedings of the Workshop on Spoken Language Systems (1992) 441–446 Hirschberg, J., C.H. Nakatani and B.J. Grosz: Conveying Discourse Structure through Intonation Variation. Proceeding of the ESCA Workshop on Spoken Dialogue Systems: Theories and Applications, Virgo, Denmark, ESCA (1995) Litman, D.J. and R. Passonneau: Empirical Evidence for Intention-Based Discourse Segmentation. Proc. of the ACL Workshop on Intentionality and Structure in Discourse Relations (1993) Litman, D.J. and R. Passonneau: Combining Multiple Knowledge Sources for Discourse Segmentation. Proc. of 33rd ACL (1995) 108–115 Nakatani, C.H., B.J. Grosz and J. Hirschberg: Discourse Structure in Spoken Language: Studies on Speech Corpora. Proceeding of the AAAI Symposium Series: Empirical Methods in Discourse Interpretation and Generation (1995) Nakatani, C.H., B.J. Grosz, D.D. Ahn and J. Hirschberg: Instructions for Annotating Discourses. Technical Report Number TR-21-95. Center for Research in Computing Technology, Harvard University, Cambridge, MA (1995) Passonneau, R.J. and D.J. Litman: Intention-Based Segmentation: Human Reliability and Correlation with Linguistic Cues. Proc. of the ACL (1993) Passonneau, R.J. and D.J. Litman: Discourse Segmentation by Human and Automated Means. Computational Linguistics (1997) Ramilo, M.C. and T. Freitas: A Linguística e a Linguagem dos Média em Portugal: descrição do Projecto REDIP. Paper presented at the XIII International Congress of ALFAL, San José, Costa Rica (2002) Carletta, J.: Assessing Agreement on Classification Tasks: The Kappa Statistic. Computational Linguistics 22 (2) (1996) 249–254 Flammia, G.: Discourse Segmentation of Spoken Dialogue: An Empirical Approach. Ph.D. thesis, MIT (1998) Beckman, M.E.: A Typology of Spontaneous Speech. In Y. Sagisaka, N. Campbell and N. Higuchi. Computing Prosody: Computational Models for Processing Spontaneous Speech. Springer, New York (1997) 7–26 Collier, R.: On the Communicative Function of Prosody: Some Experiments. IPO Annual Progress Report 28 (1993) 67–75 Oliveira, M.: Pausing Strategies as Means of Information Processing in Spontaneous Narratives. In: B. Bel and I. Marlien (eds.): Proceedings of the 1st International Conference on Speech Prosody, Aix-en-Provence, France (2002) 539–542 Oliveira, M.: Prosodic Features in Spontaneous Narratives. Ph.D. thesis, Simon Fraser University (2000) Oliveira, M.: The Role of Pause Occurrence and Pause Duration in the Signalling of Narrative Structure. In: E. Ranchhod and N. Mamede (eds.): Advances in Natural Language Processing. Third International Conference, PorTAL 2002, Faro, Portugal (2002) 43–51 Lehiste, I.: Some Phonetic Characteristics of Discourse. Studia Linguistica 36:2 (1982)
Reusability of Dictionaries in the Compilation of NLP Lexicons∗ Bento C. Dias-da-Silva, Mirna F. de Oliveira, and Helio R. de Moraes Faculdade de Ciências e Letras, Universidade Estadual Paulista Rodovia Araraquara-Jau Km 1, 14800-901 Araraquara, São Paulo, Brazil [email protected], [email protected] [email protected]
Abstract: This paper discusses particular linguistic challenges in the task of reusing published dictionaries, conceived as structured sources of lexical information, in the compilation process of a machine-tractable thesaurus-like lexical database for Brazilian Portuguese. After delimiting the scope of the polysemous term thesaurus, the paper focuses on the improvement of the resulting object by a small team, in a form compatible with and inspired by WordNet guidelines, comments on the dictionary entries, addresses selected problems found in the process of extracting the relevant lexical information form the selected dictionaries, and provides some strategies to overcome them.
1
Introduction
In their most ordinary use, published dictionaries are restricted to supply the general public with the "correct" spelling and the "attested" senses of unknown words. But for Human Language Technology researchers published dictionaries are an important resource for mining for a considerable amount of different sorts of lexical information. It must be recognized that most of them offer much more information than just spelling and word sense records. They are "fruits of the cumulative wisdom of generations of lexicographers", and "the sheer breadth of coverage makes them indispensable" for natural language processing [11, page 365]. Dictionary entries, in fact, specify not only etymological, phonological, syntactic, definitional, collocational, variational, and register information about words, but sense relations as synonymy and antonymy as well. It is also a fact that lexicographers are aware that compiling dictionary entries involves making a very hard decision as to dealing with polysemy and homonymy. In other words, they have to decide on whether to lump or split word senses, or on whether to create fresh new entries for the same word form. Such decisions, however, are arbitrary, for lexicographers take their own personal experience and expertise to make their decisions; and probably that is the only way they manage to compile their unique store of words. Thus, reusing lexicographical information requires caution. ∗
This research is sponsored by CNPq and FAPESP-São Paulo, Brazil.
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 78–85, 2003. © Springer-Verlag Berlin Heidelberg 2003
Reusability of Dictionaries in the Compilation of NLP Lexicons
79
As a rule, if we want to use dictionary lexicographical information in natural language processing projects, it must be mined and filtered carefully.1 Accordingly, the purpose of this paper is to discuss real decision problems we had to face during the task of extracting and inferring either the explicit or implicit synonymy and antonymy relations from five Brazilian Portuguese published dictionaries, our reference corpus (henceforth RC), in the compilation process of a machine-tractable Thesaurus-like Lexical Database for Brazilian Portuguese, henceforth TeP (see [6] for a complete description of the database itself). The TeP was developed in a two-year span (2000– 2001) by a small team of four linguists and a computer scientist. Resorting to Dias-daSilva’s [5] methodology for developing natural language processing projects, the team split up the task into three complementary phases: Linguistic, Representational, and Computational. This paper focuses on the discussion of selected problems that emerged from a specific task that was part of the Linguistic domain: the extraction process of lexical information from the RC. In the next sections and subsections, we delimit the scope of the term thesaurus, present the RC and comment on the key features of the published dictionary entries that make it up, describe the mining procedure, address selected problems we encountered in the process of extracting the relevant lexical information form the RC, and, when possible, provide strategies to overcome them. Kilgarriff's classification scheme of word sense distinctions the lexicographer attempts to capture will serve us well in our discussion (see [11] for details). 2
2 Preliminaries 2.1
The Thesaurus Denotations
Instead of searching for an answer to Kilgarriff´s query "What's in a Thesaurus?" (see [13] for the relevant discussion), we list below the denotations the term thesaurus has in Brazilian Portuguese (henceforth BP), and single out the one we had in mind when we embarked on the compilation of the TeP: the Object 6. Dias-da-Silva, Oliveira, Moraes [7, page 190] surveyed six different types of objects that are referred to by the term thesaurus: 1. An inventory of the vocabulary items in use in a particular language (Object1); 2. A thematically based dictionary, that is, an onomasiologic dictionary (Object 2); 3. A dictionary containing a store of synonyms and antonyms (Object 3); 4. An index to information stored in a computer, consisting of a comprehensive list of subjects concerning which information may be retrieved by using the proper key terms (Object 4); 5. A file containing a store of synonyms that are displayed to the user during the automatic proofreading process (Object 5); 6. A dictionary of synonyms and antonyms stored in memory for use in word processing (Object 6). 1
2
Acquiring such information is a hard problem and has been usually approached by reusing, merging, and tuning existing lexical material. This initiative has been frequently reported in the literature (see [11, 12], and the papers cited therein). The authors gratefully thank and acknowledge the anonymous reviewers for their comments and suggestions.
80
2.2
B.C. Dias-da-Silva, M.F. de Oliveira, and H.R. de Moraes
The Reference Corpus
The compilation of a dictionary is a time consuming activity and requires a team of more than fifty lexicographers, each responsible for (i) selecting the headwords which will head the dictionary entries, (ii) defining the number of senses for each headword, and (iii) exemplifying the senses with sentences and expressions from their corpora. The advent of computers have allowed lexicographers to use machine-readable large-scale corpora in their work, establishing procedures as follows [12]: (a) to gather concordances from the corpus; (b) to cluster the concordances around nuclear sense clusters; (c) to lump or split nuclear clusters; (d) to encode the relevant lexical information by means of the highly-constrained language of dictionary definitions. Given our small team, and the two-year time stipulated for the project, we bypassed those procedures and decided to reuse the five published dictionaries, which were chosen for the following reasons: (i) their being "fruits of the cumulative wisdom of generations of lexicographers", and their "sheer breadth of coverage"; (ii) the relevant sense relations one of the five dictionaries registers can be complemented by similar pieces of information found in the other four dictionaries; (iii) instead of using the Aristotelian analytical definition (i.e., genus and differentiae) to define word senses, they extensively use the synonymy and antonymy word forms in their defining procedure, feature that speeded up the process of collecting lots of synonym and antonym word forms. Two of them [10,15] are the most traditional and bulkier BP dictionaries. Their electronic versions speeded up the process of synonym and antonym mining. Barbosa [1] and Fernandes [9] are specific dictionaries of synonyms and antonyms. The fifth dictionary is a dictionary of verbs [2] that uses Chafe´s semantic classification of verbs [3]. For each verb headword, the dictionary registers the relevant Chafe´s categories ("state", "action", "process", and "action-process"), its sense definitions and/or synonyms for it, its grammatical features, its potential argument structures, its selectional restrictions, and sample sentences extracted from corpora. 2.3
The TeP
The RC, the Thesaurus Editor, i.e., the graphical authoring tool created to feed and manage the TeP (see [6] for details), and the strategy of "mining" lexical information from published dictionaries we present in this paper made it possible to compile 44678 word forms that are distributed throughout 19868 synonym sets [7].
3 3.1
The "Mining" Strategy and Pitfalls “Mining”
First, it is necessary to define the synonymy concept we have adopted. The TeP compilers had to agree upon a specific notion of synonymy throughout the compilation process as to assure the consistency of the synonym sets. Considering that absolute synonyms are rare in language, if they exist at all, Cruse´s [4, page 88] synonymy definition was adopted: "X is a cognitive synonym of Y if (i) X and Y are syntacti-
Reusability of Dictionaries in the Compilation of NLP Lexicons
81
cally identical, and (ii) any grammatical declarative sentence S containing X has equivalent truth conditions to another sentence S1, which is identical to S, except that S is replaced by Y." The best way to understand how the compilers "mined" for synonyms into the RC is to follow a real example. Let us take, as our starting point of the process, the BP verb lembrar (English: "to remember"). Weiszflog [15] distinguishes seven senses. After collecting the synonyms, and disregarding their definitions, the following synonym sets can be compiled: 1. {lembrar, recordar} (English: {"to remember", "to recall"}) 2. {lembrar, advertir, notar} (English: {"to remember", "to warn", "to notify"}) 3. {lembrar, sugerir} (English: {"to suggest", "to evoke", "to hint"}) 4. {lembrar, recomendar} (English: {"to remember", "to commend"}) After that preliminary analysis, the linguist checks the consistency of the four synonym sets by looking up the dictionary synonym entries for the remaining five verbs: recordar, advertir, notar, sugerir, and recomendar. Accordingly, the linguist, for example, looks up the dictionary entry for the verb recordar. Its first sense is given by the paraphrase trazer à memória (English; "to call back to memory"), and its fourth sense by the synonym lembrar. As these two senses are very close, and the examples confirm the similarity between the two, the synonym set 1 said to be consistent. The very same process is repeated to the every verb listed above until the list is exhausted. The analytical cycle begins again by collecting the synonyms from the next dictionary entry in the alphabetical order. It should be pointed up that, when the linguist analyzes the verb esquecer (English: "to forget"), the canonical BP antonym for lembrar, he finds only one synonym for it: the verb olvidar (Vulgar Latin: "oblitare"; English: "to efface") so, after the consistency analysis, the following synonym set is compiled: 5. {esquecer, olvidar}. The dictionary also registers this antonymy indirectly: lembrar and esquecer are defined by means of the paraphrases trazer à memória and perder a memória de (English: "to stop remembering"), respectively. Thus, the information checked through cross-reference of entries confirms the antonymic pair (lembrar, esquecer), which stresses the importance of examining paraphrases carefully. Just for the record: the synonym set 1 and its antonym synonym sets are transcribed bellow: 6. {amentar2, comemorar, ementar, escordar1, lembrar, memorar, reconstituir, recordar, relembrar, rememorar, rever1, revisitar, reviver, revivescer, ver} 7. {deslembrar, desmemoriar, esquecer, olvidar} In the next section, some real problems are presented. The examples are occurrences of specific kinds of problems, and reveal the necessity of data checking during the reuse process. 3.2
Pitfalls
In the heart of the task of compiling dictionaries for the general public is the specification of word sense distinctions. Kilgarriff [11, pages 372–374], on analyzing the LDOCE entries [14], proposed a classification scheme for categorizing widespread word sense distinctions made by lexicographers: "Generalizing Metaphors", i.e., a
82
B.C. Dias-da-Silva, M.F. de Oliveira, and H.R. de Moraes
sense that is the generalization of a specific sense; e.g.: martelar ("to hammer") sense 1: "to hit with a hammer", sense 2: "to insist". "Must-be-theres", i.e., one sense is a logical consequence the other sense; e.g.: casar ("to marry") - sense 1: "to unite by marriage", sense 2: "to ally". "Domain Shift", i.e., one sense is the extension of the application of the other sense to another situation; e.g.: leve ("light") - sense 1: "not heavy, with little weight", sense 2: "nimble, agile". "Natural and social kinds", i.e., "owing to a non-linguistic fact, the entities or situations identified by the different word senses have distinct denotata, and although the denotata have many attributes in common, they will always remain classes of things; e.g.: asa ("wing") - sense 1: "feathered bird´s member used to fly", sense 2: "one of the horizontal airfoils on either side of the fuselage of an airplane". This typology aided the TeP team of linguists in both (i) identifying the kind of distinctions the lexicographers had in mind when they had to take their decisions during the compilation process of their dictionaries, and (ii) avoiding carrying over published dictionary flaws to the TeP. 3.2.1 Three Classes of Problems In the process of extracting information from the dictionary entries, three categories of problems first identified by Kilgarriff [11] were detected: (a) "necessity"; (b) "consistency"; (c) "centrality". Compiling the TeP synonym sets required the reflection on whether a particular semantic feature or grammatical specification was a "necessary" feature for a lexical item in a particular sense. Checking the RC entry "consistency" implied observing symmetry, an important characteristic of synonymy, which, in general, was not observed in the RC. This relation establishes that if A is a synonym for B, B is necessarily a synonym for A. The issue of "centrality" focused on the sense variation of a particular sense, i.e., how wide the sense variation is so that a second sense should be posited instead of only one. With respect to the compilation of the TeP, this problem was pervasive because it is hard to solve because synonymy is not a transitive relation: if A is a synonym for B, and B is a synonym for C, C is not necessarily a synonym for A [8]. 3.2.2 Selected Strategies As the majority of the problems dealt with in this section are tokens of specific types, one example of each will be presented using (a) the sense distinctions, and (b) the problem types, sketched in previous sections. As the structure of the lexicon is complex, (a) and (b) alone may not be enough to solve the problems. Although the linguists focused on the specification of synonymy and antonymy, they had to be aware of logical-conceptual relations such as hyponymy, for lexicographers often consider superordinate terms (hypernyms) synonyms. The first problem to be addressed is the "Generalizing Metaphors". The BP verbs acarar, encarar, arrostar ("to stare") mean ficar face a face ("to be face to face with"), and they also mean, enfrentar ("to face"). At a first glance, one is tempted to merge the two verbs into the same synonym set: {acarar, encarar, arrostar, confrontar, enfrentar}. This sense lumping is mistaken though. Although acarar may denote a less specific sense than the other members of its original set, a TeP user would not be able to identify its most specific sense. This example demonstrates how useful the identification of generalizing metaphors in the resolution of meaning centrality problems is. The cases related to generalizing metaphors, which generate two synonym
Reusability of Dictionaries in the Compilation of NLP Lexicons
83
sets with common elements can be easily solved by the insertion of glosses for each sense, a future work. The splitting over of two similar senses was adopted: {acarar, encarar, arrostar} (English: {"to stare", "to gaze"}) and {acarar, encarar, arrostar, confrontar, enfrentar} (English: {"to face", "to confront"}). The other category of problems has to do with the "Must-be-theres". Borba [2] distinguishes only one sense for the verb visualizar ("to visualize"): perceber pela visão, conceber (sem ver) uma imagem mental de ("to perceive through vision; to conceive, without seeing, a mental image of"). The first part of the definition (perceber pela visão) ("to perceive through vision") is clearly a paraphrase of ver ("to see") and it can be confirmed by the example: Assustei-me ao visualizar à minha frente a imagem de dois homens de clã ("I got scared when I visualized the image of two clansmen before me"). In this example, we can replace visualizar by ver without any change to the sentence sense. But if visualizar is replaced by imaginar ("to imagine"), which is a synonym of the second part of the definition ("to conceive, without seeing, a mental image of"), illustrated with the sentence podemos talvez alimentar a esperança de visualizar/imaginar todas a novas dimensões da realidade ("perhaps we can hope to visualize/imagine all new dimensions of reality"), a different sense can be distinguished. Borba [2] precisely identified both senses and illustrated them with clear examples, but the two senses were not split over two different definitions. Maybe the lumping together of the two senses is the result of the lexicographer’s personal judgment to consider that the first sense ("to perceive through vision") is predictable from the sense "vision", explicit in the verb stem. The other RC dictionaries present only the sense "to imagine". Once the occurrence of two distinct senses were clearly identified, two different synonym sets were inserted in the TeP: {ver, visualizar, enxergar,...} (English: {"to see 1"}) and {ver, visualizar, imaginar} ({English: "to visualize", "to envision", "to project", "to fancy", "to see", "to figure", ...}). The third problem has to do with "necessity" and is a "Domain shift", illustrated by exalar ("to exhale"), which is defined as emitir ou lançar de si emanações odoríficas ou fétidas ("to exhale odoriferous or fetid odor"). According to this definition, the verb exalar should be inserted in two different synonym sets related to each other by antonymy: {exalar, feder, catingar} (English: {"to stink", "to reek"}), and {exalar, recender, i.e., exalar cheiro bom ("to exhale good smell)}, an inconsistent pair. To solve the problem, we point up that exalar needs a specific complement to define its sense. Something similar occurs with the verb cheirar ("to smell"). Compare O cadáver já está cheirando ("The corpse is already smelling") with O assado já está cheirando ("The roast is already smelling"). As a solution a subspecified synonym set was inserted into the TeP: {cheirar, exalar, trescalar,...} (English: {"to exhale", "to give forth", "to emanate"}) with the sense of “to exhale strong (either good or bad) smell”. This problem has to do with "centrality" and "consistency". Borba [2] considers the verbs urgir, forçar, obrigar, impelir (“urge", "force", "obligate", "exhort”) synonyms because they are interchangeable in the following context: Urgiam-nos de todos os lados para que caminhássemos ("They urged us in all possible manners for us to walk"). Weiszflog [15] also registers the same lexical items as synonyms, but exemplifies them with an example whose sense is specified by the verbs empurrar ("to push", "to force") and compelir ("to impel", "to force"). The information checking process (see 3.1), though, showed that the synonym set {urgir, compelir, forçar, obri-
84
B.C. Dias-da-Silva, M.F. de Oliveira, and H.R. de Moraes
gar, impelir,...} could be created, even though the dictionaries did not register empurrar ("to push") with this, or with any other sense of urgir ("to urge") whatsoever. Thus, although Weiszflog discriminated two different senses, the compilers agreed to establish only one. Two kinds of problems can be illustrated by this example: (i) "centrality", because the central problem is to define whether empurrar ("to push") should be inserted in that synonym set; (ii) "consistency", because Weiszflog established two senses, where the compilers expected only one. In this case, where the dictionaries registered two different senses, while the compilers identified only one, only one sense was inserted: {urgir, compelir, forçar, obrigar, impelir, ...}. The lexical item empurrar ("to push") was not inserted in the synonym set, for no relevant contextual occurrence was found in the RC.
4
Final Remarks
As discussed in the introduction, the reusability of published dictionaries for Human Language Technology purposes is a very productive work strategy. As the paper showed, however, care must be taken not to carry over their flaws into machinetractable lexicons. An inconsistency sample from BP dictionaries was presented, and some ways to overcome them were sketched. Despite their imperfections, the dictionaries we selected as our RC proved to be valuable resources of lexical-semantic information. Thanks to them, and to the systematic "mining" process and filtering strategies, the TeP, with its 20000 synonym sets, can be refined and updated to the Wordnet.Br. Accordingly, further steps will involve the specification of glosses for each sense, of example sentences and expressions for each word form, and of the logical-conceptual relations of meronymy/holonymy and hyponymy/hypernymy.
References 1. 2. 3. 4. 5. 6.
7. 8.
Barbosa, O. Grande Dicionário de Sinônimos e Antônimos. Ediouro, Rio de Janeiro (1999) Borba, F.S. (coord.) Dicionário Gramatical de Verbos do Português Contemporâneo do Brasil. Editora da Unesp, São Paulo (1990) Chafe, W. Meaning and the Structure of Language. The University of Chicago Press, Chicago (1970) Cruse, D.A. Lexical Semantics. Cambridge University Press, New York (1986) Dias-da-silva, B.C. Bridging the gap between linguistic theory and natural language processing In: Caron, B. (ed.) 16th International Congress of Linguists. Pergamon-Elsevier Science, Oxford (1998) 10 p. Dias-da-Silva, B.C., Oliveira, M.F., Hasegawa, R., Moraes, H.R., Amorim, D., Paschoalino, C. Nascimento, A.C. A construção de um thesaurus eletrônico para o português do Brasil. In: Proceedings of the 5th PROPOR – Encontro para o processamento computacional da língua portuguesa escrita e falada, Atibaia, Brazil, (2000) 01–10 Dias-da-Silva, B.C., Oliveira, M. F., Moraes, h. r. Groundwork for the Development of the Brazilian Portuguese Wordnet. In: Ranchhold, E.M.; Mamede, N.J. (eds.) Advances in natural language processing.: Springer-Verlag, Berlin (2002) 189–196 Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, Mass (1998)
Reusability of Dictionaries in the Compilation of NLP Lexicons 9. 10. 11. 12. 13. 14. 15.
85
Fernandes, F. Dicionário de Sinônimos e Antônimos da Língua Portuguesa. Globo, São Paulo (1997) Ferreira, A.B.H. Dicionário Aurélio Eletrônico Século XXI (versão 3.0). Lexicon, São Paulo (1999) Kilgarriff, a. Dictionary word sense distinctions: an inquiry into their nature. Computer and the Humanities, 26 (1993) 365–387 Kilgarriff, A. I don´t Believe in Word Senses. Computer and the Humanities 31 (1997) 91113 Kilgariff, A., Yallop, C. "What's in a Thesaurus?". In: Proceedings of the 2nd Conference on Language Resources and Evaluation, Athens, Greece (2000) 8 p. Summers, D. (ed.) Longman Dictionary of Contemporary English. Longman, Essex (1995) Weiszflog, W. (ed) Michaelis português – moderno dicionário da língua portuguesa (versão 1.1). DTS Software Brasil Ltda, São Paulo (1998)
Homonymy in Natural Language Processes: A Representation Using Pustejovsky’s Qualia Structure and Ontological Information∗ 1
2
Claudia Zavaglia and Juliana Galvani Greghi 1
Universidade Estadual Paulista, UNESP/IBILCE São José do Rio Preto, SP, Brazil [email protected] 2 Núcleo Interinstitucional de Lingüística Computacional NILC, USP/São Carlos, Brazil [email protected]
Abstract. This paper presents a proposal for the semantic treatment of ambiguous homographic forms in Brazilian Portuguese, and to offer linguistic strategies for its computational implementation in Systems of Natural Language Processing (SNLP). Pustejovsky’s Generative Lexicon was used as a theoretical model. From this model, the Qualia Structure – QS (and the Formal, Telic, Agentive and Constitutive roles) was selected as one of the linguistic and semantic expedients for the achievement of disambiguation of homonym forms. So that analyzed and treated data could be manipulated, we elaborated a Lexical Knowledge Base (LKB) where lexical items are correlated and interconnected by different kinds of semantic relations in the QS and ontological information.
1
Introduction
The objective of this paper is to give researchers in computational linguistics, specialists in computational implementation, computational lexicographers, that is, scientists and all involved in sciences that work with and are interested in Natural Language Processing (NLP), procedures and strategies of a linguistic nature to be used in the elaboration of lexical repertoire for computational treatment and construction of Linguistic Resources for Brazilian Portuguese. Our attention is turned to one of the linguistic phenomenons present in natural language that becomes a real obstacle for the efficient elaboration and treatment of this type of lexicon, that is: the ambiguity brought about by homonyms. Taking this as the starting point, we propose to present and suggest a type of computational-linguistic treatment for homonyms in Brazilian Portuguese, specifically for homographs. For the proposed lexical structure, one of the aspects of James Pustejovsky’s [1] Generative Lexicon (GL) model, namely: the Qualia Structure and its Formal, Constitutive, Telic and Agentive roles used as theoretical framework. Based on these aspects we suggest a structural-semantic approach for the homographic forms studied, furthermore, we suggest the use of ontology of ∗
Work partially financed by the CNPq – Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil.
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 86–93, 2003. © Springer-Verlag Berlin Heidelberg 2003
Homonymy in Natural Language Processes
87
concepts to categorize these forms. By suggesting these tactics for the description of homographic items our goal is to provide resources to recover the amplitude and multiplicity of meanings, considering the disambiguation of meanings contained in each one of these forms.
2
The Qualia Structure
To Pustejovsky [1], different word meanings are associated to distinct lexical items. In his decompositional view, lexical items are minimally decomposed into templates of set features. Thus, the emergence of a generative structure to compose lexical meanings is possible, defining the format of the conditions for a semantic expression of language. This same author proposes a new path for the decomposition view, focused on the generative or compositional aspect of semantics and not that of the specific decomposition in primitive numbers. In this way, a generative lexicon is characterized as a computational system that involves, at least, four levels of representation: (1) Argument Structure, which specifies the number and type of logical arguments and how they are syntactically expressed; (2) Event Structure, defining event type of a lexical item and a sentence, and including event types such as STATE, PROCESS and TRANSITION that may have a subevent structure; (3) Qualia Structure, which includes modes of explanation distributed among four roles FORMAL, CONSTITUTIVE, TELIC, and AGENTIVE (4) Lexical Inheritance Structure, identifying how a lexical structure is related to other structures and its contribution to global organization of the lexicon. The Qualia Structure specifies four essential roles of a word meaning (or Qualia): (i) Constitutive, i.e., that which expresses the relation between an object and its constitutive parts; (ii) Formal, or, that which distinguishes the object within a larger domain; (iii) Telic, that which expresses the objective/scope and function of the object; (iv) Agentive, i.e., that which considers the factors involved in the origin of the object. The Qualia structure is, in fact, much closer to the structural description of a sentence in a syntactic analysis, inasmuch as it admits something like the transformational operations used to capture or retrieve both the polymorphic behavior as well as the meaning of a lexical item in the phenomenon of novel word creation. For Pustejovsky, Qualia is, in every way, like a set of property events associated to the lexical item that best explains what that word means. For example, to understand what lexical items like cookie and beer mean, one should recognize that they are, respectfully, a type of food and a type of drink. While cookie is a term that describes a specific type of object in the world, the expression “foodstuff” denotes a functional reference of what is “done with” something, i.e., how this same thing is used. In this case, the term is partly defined by the fact that food is something to be eaten. Similar observations can be made for beer. The Telic quale for the name food encodes the functional aspect of the meaning, represented as [TELIC = to eat]. In the same way, the distinction between semantically related names such as novel and dictionary is derived from “what is done with” these objects, which is different. Thus, although these two objects may be “books”, in the general sense, the use made for each one of them is different: while a “novel” is for “reading”, a “dictionary” is for “consultation”. Consequently, the Qualia values encode the functional information for “novel” and “dictionary” in a distinct form: [TELIC = to read] for “novel” and [TELIC = to consult]
88
C. Zavaglia and J.G. Greghi
for “dictionary”. Obviously, the distinction between these two objects is not made only by means of these different roles in the Qualia telic structure. The type of textual structure for each one of them is recovered in the “constitutive” role of Qualia Structure. Whereas “novel” is characterized as a narrative or story, “dictionary” is defined as a list of words. Thus, we have the representation [CONST1 = narrative] for “novel” and [CONST = list of words] for “dictionary”. These two objects are characterized in an identical form in the formal role: [FORMAL = book] for “novel” and [FORMAL = book] for “dictionary”. On the other hand, they also differ in the agentive role of the Qualia Structure, that is: on how their “existence” came about, or, while a “novel” is written, a “dictionary” is compiled, that is, organized: [AGENT2 = written] for “novel” and [AGENT = organized] for “dictionary”.
3
The Linguistic Phenomenon of Homonymy
By homonymy we understand it to be a linguistic phenomenon that registers the identity of two words at the expression level, that is, perfectly identical forms distinguished semantically (one significant for two meanings, at the content level) or the identity of two grammatical constructions that lead to ambiguity. The first refers to lexical homonymy and the second to structural homonymy. Our specific interest for this paper is on lexical homonymy, as defined in detail by Zavaglia [3]: lexical homonymy possesses equal graphic or phonetic forms. In the first case, the words retain their graphic identities (homographs) and in the second, their sound identities (homophones). Thus, we have homographic words that: (i) have distinct meanings and are either grammatically or orally identical, which is then called Semantical Homonymy; as in: banco1: “object made for sitting” X banco2: “place where we make money deposits”; ponto1 : “portion of space designated with precision” X ponto2: “degree determined on a scale of values” X ponto3: “each part of a speech, text, of a list of topics of a program” X ponto4: “every extension of the wire between two holes made by needles3”; importar1: “bring something from a foreign country” X importar2 :“to be significant, to amount to” ; (ii) are distinct because they belong to diverse grammatical classes and are identical in pronunciation, in this case, called Categorial Homonymy, as, for example; abandono1 (noun) X abandono2 (verb); ameaça1 (noun) X ameaça2 (verb); (iii) are distinct in their etyma and phonetically and graphically identical, in this case, named Etymological Homonymy; as, for example: manga1: “fruit” [From malaiala manga.] X manga2: “part of clothing” [From lat. manica, 'tunic sleeve'.]; (iv) are phonetically distinct, and are thus named Heterophonous Homonymy4, where the noun is phonetically identical to [e] and the verb to [ε], for the vowel “e”, as in the following examples: sossego1 (noun) X sossego2 (verb); aperto1 (noun) X aperto2 (verb)b
1
Constitutive Agentive 3 [4] 4 Form possessing identical spelling but different pronunciation 2
Homonymy in Natural Language Processes
4
89
Homonymy in Qualia Structure
In homonymous forms, the Qualia Structure plays a decisive part in the verification and distinction of meanings. Let’s look at an example of representation based on Homonymous Single-Category Monosemous Forms – HSMF, that is, homographic forms of identical grammatical category, although each one of them contains only one sense of “banco”: {banco$0_1 CONST = furniture; FORMAL = object; TELIC = to sit; AGENT = material} e {banco$0_2 CONST = company; FORMAL = institution; TELIC = to negotiate; AGENT = place}.
5
Ontology of Concepts for Brazilian Portuguese
For Gruber [5], ontologies share and reuse the knowledge of the world. In fact, according to the author: “the term ontology means a specification of concepts, that is, an ontology is a formal description of concepts and existing relations between these in a determined domain” [5] apud [6]. According to Ortiz [7], p.2, semantics based on ontology in Natural Languages Processes (NLP) serves as: (a) support for the translation of lexical blanks; (b) support for disambiguation, both lexical as well as structural; (c) to give adequate treatment to the phenomenon of synonymy. At the same time, Tiscornia, [8], p.1, says that for the development of computational applications it is necessary to treat the models of human cognitive mechanisms and the process of knowledge formation individually, and that formal ontology, one of the most recent approaches to knowledge modulation, is, in reality, a revisitation of philosophical and linguistic theories. In this sense, the ontological categories are “subdivisions of a classification system used to catalog knowledge, for example, based on data” [8], p.4. The most common taxonomy of ontology is of the hereditary type where classes and sub-classes maintain hierarchical relationships in the shape of trees. The hierarchical taxonomy can be verified from the moment we have axioms of the type: (1) All land animals are an animal, therefore, it is a living entity, a concrete entity and an entity: a dog is an animal, a living being and concrete. The members of the same category or sub-category have some properties in common: in the sub-category “land animal”, for example, its members “bull”, “dog”, “rabbit” have paws, walk, don’t speak; their common properties are, therefore, inherited by the insertion of a word in one or another category. In Zavaglia [3], you will find the continuity of the above mentioned Ontology for Brazilian Portuguese.
6
Homonymy in Ontological Structuring
We would like to point out that at the moment, in the NLP field, especially when dealing with Knowledge Based Lexical Systems, it is agreed upon that the inclusion of this type of semantic repository, i.e., of the ontological type for meaning representation, is essential. There is a need to offer, in a structured and organized form, a common lexicon used in conformity by a determined community. On-
90
C. Zavaglia and J.G. Greghi
tologies has been widely used in knowledge representation of restricted domains, especially for document information or indexing search systems, where its application can be more efficient because it deals, exactly, with lexical sets of finite numbers. In a Lexical Knowledge Base – LKB, for example, the use of ontology can serve as a support resource to the information contained in the lexical repository of the base to make it possible to retrieve the meaning of a lexical item in an unambiguous form. True, the linguistic-classification resources that the use of an ontology may offer the linguist and/or lexicographer serve to allow him to uniformly individualize, within the various meanings or various attributable meanings of the same lexical item, the pertinent meaning from within the array of polysemic meanings that the word may contain, and in this way, neutralizing the polysemy that characterizes this very same homonymous forms.
7
LBK Representation Modules
The proposed Lexical Knowledge Base – LKB contains five modules of representation. In this paper, we will present only the ones pertinent to the qualia structure and to ontological information. All these modules are correlated in a way that allows the information contained in them to be linked and interconnected, depending on the type of research/search the user intends to make in the system. Each one of the modules presents the word, that is, the Semantic Unit [SemU] that is being researched, along with its characterization, i.e., what type of homonym it is. All the terms used in all the modules were projected to be explanatory links, that is, the user will receive information, definitions and explanations about the linguistic phenomenon of homonymy, with the objective of being instructive, according to what is expected the user to “learn” about what is “homonymy”, what is a “Homonymous Single-Category Monosemous Form”, what is a [SemU] or [HomoU]. Furthermore, it was not only the linguistic phenomenon of homonymy, which was studied; the LKB contains information on polysemy and monosemy. Information about Pustejovsky’s Qualia Structure [1] was also included; consequently, the user may learn the meaning of a “formal role”, a “constitutive role”, a “telic role” or an “agentive role”. Thus every term or acronym or abbreviation designated as a link will be underlined: [SemU], Homonymous Single-Category Monosemous Form, Semantic Homonymy, etc. 7.1
The LKB
With the objective of visualizing some advantages of having information of a diversified nature stored in an electronic database, we created and propose an interface for access to the Lexical Knowledge Base – LKB data. See the prototype of the LKB in [3]. The LKB development process can be divided into two distinct steps: (i) Modeling and implementation of the database and (ii) Implementation of the data access interface. To be able to build the LKB it was necessary to model the database according to the Relational Data Model. This model was presented for the first time in 1970, by Codd, and has been widely used along the years in the development of applications that use databases [9]. This model uses as relationships fundamental data structure, represented by tables that list the stored data. The real world categories that
Homonymy in Natural Language Processes
91
must be analyzed and stored are called entities, and each entity must be stored in a register (or line) of the table. The fundamental singularity of this model is that it avoids data redundancy by using the normalization of the tables. In this way, data pertinent to the same entity is stored in different tables, and, at the moment the data is accessed, these tables are carefully analyzed and correlated. In the beginning, the LKB data was stored in text files. So this data could be automatically transferred to the base it became essential to develop a computational tool that could make the necessary conversions. This tool can read the entry file, line by line; adequately separates the data and inserts them in the proper table, relating them to one another. The computational language used for the implementation of this tool was Delphi. After the insertion of the data, the next step was to develop the interface to access the data. It should be noted that this project is in progress and the prototypes of the interfaces are gradually being modified, as well as being fed with new linguistic information. One of the objectives of developing an application of such a nature is to make it available to the greatest possible number of users, and, to do this, we chose to develop an interface with Web access (World Wide Web). Data search is set off by a word in the Portuguese language and to have access to stored data, the user must chose one of the five available modules. As previously mentioned, this paper will detail the ontological and qualia modules. 7.2
Ontological Module
The Ontological Model contains information about the fundamental Classes and the Domain, that is, the distribution of the homonymous forms in conceptual categories5. In this way, the conceptual organization of a homonymous form begins with hyponemous relations [it’s a kind of...] and hyperonomy [it’s a superkind of...], also being included in a specific world domain [belongs to domain...]. Besides this, with LKB it is possible to retrieve all categories with which the homonymous form is correlated to up to its first category, as in the word CABO(1a): {4live entity4concrete entity4entity}.
7.3
Qualia Structural Module
The Qualia Structure contains information about existing semantic relations between two Semantic Units, according to their roles (formal, telic, constitutive, agentive) in the Qualia Structure. These semantic relations retrieve the multiple dimension of meaning of a homonymous form. The Qualia roles were designed as links to provide the user with its meaning.
5
Ontological distribution is, essentially, a manual and human activity, at least at the present state of art
92
8
C. Zavaglia and J.G. Greghi
Final Considerations and Future Perspectives
The scope of this paper, that is, its computational version, was initially prompted by two motives: (i) the fact that we could demonstrate that the result of our proposal could be real and that, on the contrary, would not be destined to the “virtual” world. Consequently, we established the validity of the linguistic analyses we made to build the linguistic framework by using homonymous items as entry words, since they were capable of supporting computational implementation; (ii) the fact of being able to demonstrate the advantages of having information of a diversified nature stored in an electronic database. Among these uses we can highlight: (i) the quick retrieval of varied linguistic information about homonymous items; (ii) specialized search for certain linguistic information, by means of automatically generated lists that may be used in several types of research; (iii) the potential possibility of using the linguistic data contained in the LBK lexical repertoire to be applied to Systems of Natural Language Processing, in Search Engines, Semantic Parsers, Disambiguators, Automatic Translation, Taggers, etc. In effect, the fact that we included a varied range of linguistic information of a pluridimensional nature (lexical, morphosyntactic, ontological, qualia, disambiguator) permits us to foresee its diverse applications Concurrently, the first perspective of the future study that causes tremendous enthusiasm is the possibility of enriching and expanding the Lexical Knowledge Base with other types of lexical items that goes beyond homonyms. In fact, studies should be made with monosemic and polysemic words. When looking at the homonymy phenomenon there is still a great deal to be done, since we only dealt with one type of homonymy, that is, with Semantics. We should also work with Categorical Homonyms, Heterophonous Homonyms and Etymological Homonyms, even though we have already considered some cases where homonyms can be distinct in regard to their etyma, such as the case with “manga”. Finally, there is still the possibility of including new semantic relations, syntactic information, information on the argumental structure, of insertion of synonyms and antonyms for the lexical items in a systematic manner, to name a few.
References 1. 2. 3. 4. 5.
Pustejovsky, J. The Generative Lexicon. Cambridge: The MIT Press (1995). Moravcsik, J.M.E. Sameness and Individuation. Journal of Philosophy, 70:513–526, (1973) Zavaglia, C. Análise da homonímia no português: tratamento semântico com vistas a procedimentos computacionais. Tese de Doutorado. Araraquara: Universidade Estadual Paulista (2002). Biderman, M.T.C. Dicionário didático de português. 2 ed. São Paulo: Ática1 (1998). Gruber, T.R. Toward principles for the design of ontologies used for knowledge sharing. Presented at the Padua workshop on Formal Ontology, March 1993, to appear in an edited collection by Nicola Guarino. In:
Homonymy in Natural Language Processes 6.
7. 8. 9.
93
Braga, J.L.; Torres, K.S.; Botelho, F.C. Reengenharia e Visualização de Conceitos no WordNet. Universidade Federal de Viçosa. In:
Using Adaptive Formalisms to Describe Context-Dependencies in Natural Language Jo˜ ao Jos´e Neto and Miryam de Moraes Lab. de Ling. e Tecnologias Adaptativas, Esc. Polit´ecnica da Univ. de S. Paulo Av. Prof. Luciano Gualberto tr. 3 n. 158, Cid. Universit´ aria 05508-900 S.Paulo, Brazil {joao.jose,miryam.moraes}@poli.usp.br http://www.pcs.usp.br/˜lta
Abstract. This text sketches a method based on adaptive technology for representing context-dependencies in NL processing. Based on a previous work [4] dedicated to syntactical ambiguities and non-determinisms in NL handling we extend it to consider context-dependencies not previously addressed. Although based on the powerful adaptive formalism [3], our method relies on adaptive structured pushdown automata [1] and grammars [2] – resulting simplicity, low-cost and efficiency.
1
Introduction
Since low-complexity language formalisms are too weak to handle NL, stronger formalisms are required, most of them resource demanding, hard to use or unpractical. Structured pushdown automata are excellent to represent regular and context-free aspects of NLs by allowing them to be split into a regular layer (implemented as finite-state machines) and a context-free one (represented by a pushdown store). Such device accepts deterministic context-free languages in linear time, and is suitable as an underlying mechanism for adaptive automata, allowing handling – without loss of simplicity and efficiency – languages more complex than context-free ones. Classical grammars may describe non-trivial interdependencies between and inside sentences: attribute-, two-level-, evolving- and adaptive- grammars. Here, context dependency is handled with adaptive grammars (which may be converted [2] into structured pushdown adaptive automata [3]) by executing adaptive actions attached to the rule being used (stating self-modifications – rule addition and deletion – to be imposed). Upon a context-dependency is detected, one of such rules is applied, and the attached adaptive action learns to handle of the context dependency by adequately changing the underlying grammar. Starting from an initial grammar, the adaptive device follows its rules until some new context-dependency is handled. Thereafter, its operation follows the modified underlying grammar until either the sentence is fully derived or no matching rule is found. Complex languages, e.g. NLs, may be handled in this way, since adaptive grammars have type-0 power [1], [2]. By converting them into adaptive N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 94–97, 2003. c Springer-Verlag Berlin Heidelberg 2003
Using Adaptive Formalisms to Describe Context-Dependencies
95
structured pushdown automata, simplicity and efficiency are achieved through piecewise-regular handling of the language, validating adaptive devices as practical and efficient for NL handling [5].
2
Illustrating Example
This example illustrates nominal agreement in Portuguese using an adaptive grammar [2] and considers: attractive agreement for adjectives placed before nouns coordinated with preposition “e” or comma (e.g. As antigas mans˜ oes e parques) and grammatical agreement for adjectives placed after such nouns (e.g. Os parques e mans˜ oes restaurados). Our adaptive grammar is defined as a 3tuple (G0 , T, R0 ) where: T = finite set of adaptive functions 0 G0 = (VN0 , VT0 , VC , PL0 , PD , S) initial grammar VN0 = non-empty finite set of non-terminals VT = non-empty finite set of terminals VN0 ∩ VT = Ø VC = finite set of context symbols V 0 = VN0 ∪ VT ∪ VC VN0 , VT , VC are disjoint sets S ∈ VN0 start symbol of the grammar PL0 rules used in CF situations
0 PD rules used in CD situations 0 0 R = PL0 ∪ PD The example refers to G = (G0 , T, R0 ): 0 G0 = (VN0 , VT0 , VC , PL0 , PD , S) 0 VN = {C, C1 , C2 , C3, C4 , C5 , C6 , C7 , D, A, S, C8a , C8l , ESM, ESF, EP M, EP F } VC = {sm, sf, pm, pf } VT = {as, e, antigas, mans˜ oes, parques, restaurados, pra¸cas, “,”}
Context symbols sm, sf , pm, pf denote attributes: singular/plural masc./fem. D, A, S denote determinants, adjectives and nouns, respectively. Starting symbol is C. Adaptive functions dynamically handle optional elements: further nouns, a determinant, an adjective placed before/after the noun. A context-dependency is handled by an adaptive function when the noun is processed. It checks its agreement with the previous determinant and adjective. Another adaptive function 0 enforces agreement between the adjective and multiple nouns.PL0 and PD are: C → {A1 (C, C1 )}DC1 sf C1 → {A1c (C1 , C, ES, EF )}ESF pf C1 → {A1c (C1 , C, EP, EF )}EP F smC1 → {A1c (C1 , C, ES, EM )}ESM pmC1 → {A1c (C1 , C, EP, EM )}EP M C → {A2 (C, C2 , C1 )}AC2 sf C2 → {A1c (C2 , C, ES, EF )}ESF pf C2 → {A1c (C2 , C, EP, EF )}ESF smC2 → {A1c (C2 , C, ES, EM )}ESM pmC2 → {A1c (C2 , C, EP, EM )}EP M C → SC3 sf C3 → {A3c (C6 , C1 , C2 , ES, EF )}ESF smC3 → {A3c (C6 , C1 , C2 , ES, EM )}ESM ESM → C ESF → C EP M → C EP F → C
pf C3 → {A3c (C6 , C1 , C2 , EP, EF )}EP F pmC3 → {A3c (C6 , C1 , C2 , EP, EM )}EP M C7 → Ø C3 → ε C3 → C6 C3 → “e”C4 C3 → “, ”C4 C4 → SC5 smC5 → {A4 (C6 , ES, EM )}ESM sf C5 → {A4 (C6 , ES, EF )}ESF pmC5 → {A4 (C6 , EP, EM )}EP M pf C5 → {A4 (C6 , EP, EF )}EP F C3 → {A5 (C3 , C6 ), C7 , C8a , C8l )}AC7 ESM → C3 ESF → C3 EP F → C3 EP M → C3
96
J.J. Neto and M. de Moraes
Adaptive Actions: A1 (XC, XC1) = /*remove extra determinant*/ {−[XC → {A1 (XC, XC1 )}DXC1]} A1C (x1,x2,xn,xg) = /*Delete transitions with improper context symbols*/ {−[smx1 → {A1c (x1, x2, xn, xg)}ESM ] −[sf x1 → {A1c (x1, x2, xn, xg)}ESF ] −[pmx1 → {A1c (x1, x2, xn, xg)}EP M ] −[pf x1 → {A1c (x1, x2, xn, xg)}EP F ] /*ATK: dummy adaptive action. It memorizes determinant attributes*/ +[x1 → {AT K(xn, xg)}x2]} A2 (XC, XC2, XC1) = {−[XC → {A1 (XC, XC1)}DXC1] /*disable further det. or adj.*/ −[XC → {A2 (XC, XC2, XC1)}AXC2]} A3c (xc6, xc1,xc2,xc,xn,xg) = {dn, dg, an, ag : /*memorize noun attributes*/ A4 (xc6, xn, xg) /*fix inflexion of determinant and adjective before noun*/ −[xc1 → {AT K(dn, dg)}xc] −[xc2 → {AT K(an, ag)}xc] +[xc1 → {AT K(xn, xg)}xc] +[xc2 → {AT K(xn, xg)}xc] A4 (xc6,xn,xg) ={x, s∗ : /*ATKS: dummy adaptive action. It memorizes noun attributes*/ −[x → xc6] +[x → {AT KS(xn, xg)}s] +[s → xc6]} A5 (xc3,xc6,xc7,xc8a,xc8l) = {x, xsm, xsf, xpm, xpf, xn1, xn2, xn3, xn4, xg1, xg2, xf 1, xf 2, x1, x2, xaux1∗, xaux2∗ : /* imposes attractive agreement*/ ?[x → xc6] ?[xsm → {AT KS(ES, EM )}x] ?[xsf → {AT KS(ES, EF )}x] ?[xpf → {AT KS(EP, EF )}x] ?[xpm → {AT KS(EP, EM )}x]
+[smxc7 → {AT (xsm, xc7, xaux1)}xaux1] +[sf xc7 → {AT (xsf, xc7, xaux1)}xaux1] +[pmxc7 → {AT (xpm, xc7, xaux1)}xaux1] +[pf xc7 → {AT (xpf, xc7, xaux1)}xaux1] /*initializes logical agreement */ ?[xc3 → {AT KS(xn1, EF )}xf 1] ?[xc3 → {AT KS(xn2, EM )}xm1] ?[xf 1 → {AT KS(xn3, xg1)}x1] ?[xm1 → {AT KS(xn4, xg2)}x2] +[pf xc7 → {Z(xf 1, x1, xaux2, xc8l)}xaux2] +[pmxc7 → {Z(xm1, x2, xaux2, xc8l)}xaux2] CL(xf 1, xaux2, xc8l, xc7)} CL(xf1,xaux2,xc8l,xc7) = /* completes logical agreement*/ {xn1, xn2, xm, xf : ?[xf 1 → {AT KS(xn1, EM )}xm] ?[xf 1 → {AT K(xn2, EF )}xf ] +[pmxc7 → {Z(xm, xm, xaux2, xc8l)}xaux2] EliminaP F (xm, xc7, xaux2) CL(xf, xaux2, xc8l, xc7)} EliminaPF(xm,xc7,xaux2) = /* removes the pl. fem. agreement*/ {x, y, z : −[pf xc7 → {Z(x, y, xaux2, z)}xaux2]} Z(x,y,z,xc8l) = /*inserts a transition to a final state*/ {+[z → xc8l]} AT(x,xc7,xaux1)={xsmx, xsf x, xpmx, xpf x : /*performs transition self removal*/ −[smxc7 → {AT (x, xc7, xaux1)}xsmx] −[sf xc7 → {AT (x, xc7, xaux1)}xsf x] −[pmxc7 → {AT (x, xc7, xaux1)}xpmx] −[pf xc7 → {AT (x, xc7, xaux1)}xpf x] /*inserts an adequate transition*/ +[smxc7 → xsmx] +[sf xc7 → xsf x] +[pmxc7 → xpmx] +[pf xc7 → xpf x]} ATK (xn,xg) = { } ATKS (xn,xg) = { }
This simple example illustrates NL processing through adaptive formalisms. The following is a simplified derivation of the sentence “as antigas pra¸cas, parques e mans˜ oes restaurados” in our grammar.
Using Adaptive Formalisms to Describe Context-Dependencies
97
C ⇒0 {A1 (C,C1 )}DC1 ⇒1 as pf C1 ⇒1 {A1c (C1 , C, EP, EF)} EPF ⇒2 as C ⇒2 as {A2 (C,C2 ,C1 )} AC2 ⇒3 as antigas pf C2 ⇒3 as antigas {A1c (C2 , C, EP, EF) }EPF ⇒4 as antigas C ⇒4 as antigas S C3 ⇒4 as antigas pra¸cas pf C3 ⇒4 as antigas pra¸cas {A3c (C6 ,C1 ,C2 ,C,EP,EF)}EPF ⇒5 as antigas pra¸cas C3 ⇒5 as antigas pra¸cas, C4 ⇒5 as antigas pra¸cas, S C5 ⇒5 as antigas pra¸cas, parques pm C5 ⇒5 as antigas pra¸cas, parques {A4 (C6 ,EP,EM)}EPM ⇒6 as antigas pra¸cas, parques C3 ⇒6 as antigas pra¸cas, parques e C4 ⇒6 as antigas pra¸cas, parques e mans˜ oes pf C5 ⇒6 as antigas pra¸cas, parques e mans˜ oes {A4 (C6 , EP, EF) }EPF ⇒7 as antigas pra¸cas, parques e mans˜ oes C3 ⇒7 as antigas pra¸cas, parques e mans˜ oes {A5 (C3 ,C6 , C7 ,C8a ,C8l )} A C7 ⇒8 as antigas pra¸cas, parques e mans˜ oes restaurados pm C7 ⇒8 as antigas pra¸cas, parques e mans˜ oes restaurados {Z(S2,S2,xaux2,C8l )} xaux21 ⇒9 as antigas pra¸cas, parques e mans˜ oes restaurados C8l ⇒9 as antigas pra¸cas, parques e mans˜ oes restaurados.
Remark. ”⇒i ”, i ∈ N, denotes derivation over the rules PL i ∪ PD i . They are available after the execution of the adaptive actions.
3
Conclusion
Many forms are reported in the literature [5] for the representation of NLs and their processing by a computer. Extending the results achieved in our previous works, this paper reports a proposal for the implementation of a method for modeling, representing and handling context-dependencies in NLs by means of adaptive devices [3]. The incremental dynamic nature of our device turns it into an attractive and low-cost option with good static and dynamic time and space performance.
References [1] Jo˜ ao Jos´e Neto: Contribui¸c˜ ao a ` metodologia de constru¸c˜ ao de compiladores. S˜ ao Paulo, 1993, 272p. Thesis (Livre-Docˆencia) Escola Polit´ecnica, Universidade de S˜ ao Paulo.[In Portuguese] [2] Iwai, M.K. Um formalismo gramatical adaptativo para linguagens dependentes de contexto. S˜ ao Paulo 2000, 191p. Doctoral Thesis. Escola Polit´ecnica, Universidade de S˜ ao Paulo. [3] Neto, J.J.: Adaptive rule-driven devices – general formulation and case study – CIAA’2001 Sixth International Conference on Implementation and Application of Automata. Lecture Notes in Computer Science, Springer-Verlag, Pretoria (2001) [4] Neto, J.J., Moraes, M.: Formalismo adaptativo aplicado ao reconhecimento de linguagem natural – Anais da Conferencia Iberoamericana en Sistemas, Cibern´etica e Inform´ atica, 19–21 de Julio, 2002, Orlando, Florida (2002) [5] Taniwaki, C.Y.O.: Formalismos Adaptativos na An´ alise Sint´ atica de Linguagem Natural – MSc Dissertation, Escola Polit´ecnica da Universidade de S˜ ao Paulo (2002)
Some Regularities of Frozen Expressions in Brazilian Portuguese Oto Araújo Vale Faculdade de Letras, Universidade Federal de Goiás Caixa Postal 131 74001-970 Goiânia, GO, Brazil [email protected]
Abstract. In this paper we examine a class of 125 frozen expression in Brazilian Portuguese. This class is one of the ten classes established after a typology of 3.550 verbal expressions, according to the distribution of the fixed and free components of each expression. Some regularity could be observed in this class, which resulted in the construction of a graph in order to identify the expressions of this class in large corpora.
1
Introduction
Frozen expressions constitute a great problem in natural language processing. Gross [1] has shown that for French the number of verbal frozen expressions (VFE) is much larger than simple verbs. In this paper we present a class of 125 VFE, which belongs to a typology of Brazilian Portuguese VFE we established [2]. This typology was constructed in Lexicon-Grammar tables [3]. The Lexicon-Grammar tables are binary matrixes with the expression in the lines and the syntactic and semantic properties in the columns. This kind of matrix is useful to visualize the most frequent properties and is also easily used for Natural language Processing: some softwares – Intex [4] or Unitex [5] – utilize these matrixes to construct graphs and Finite State automata to make search in large corpora [6]. In Vale [2] a typology was established of ten classes containing 3.550 verbal frozen expressions. This typology was set up by the distribution of the frozen and free elements in each expression. Only the expressions with a free subject were classified in this typology: It can be observed in Table 1 that the more simple constructions have the most numbers of VFE. The PB-C1 class, with only one frozen element without preposition, has a significant number of expressions, whereas the PB-CPP, with two frozen elements, each of which introduced by a preposition, presents a reduced number of VFE. This classification allows us to observe many regularities in the constitution of VFE. Those regularities are interesting to the theoretical approach and also to NLP.
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 98–101, 2003. © Springer-Verlag Berlin Heidelberg 2003
Some Regularities of Frozen Expressions in Brazilian Portuguese
99
Table 1. Ten classes established by Vale [2]1 Class PB-C1 PB-CP1 PB-CDH PB-CDN PB-C1PN PB-CP1PN PB-CNP2 PB-C1P2 PB-CPP PB-C1P2DN
2
Structure N0V C1 N0V Prep C1 N0V (C de Nhum)1 N0V (C de N)1 N0V C1 Prep N2 N0V Prep C1 Prep N N0V N Prep C2 N0V C1 Prep C2 N0V Prep C1 Prep C2 N0V Prep C1 Prep (C de N)2
Example Rui bateu as botas Rui entrou pelo cano O filme encheu o saco de Rui O aviso acendeu o pavio da crise Ana arrasta uma asa por Rui Rui pisou no calo de Ana Rui colocou Ana para escanteio O governo pôs as cartas na mesa Rui mudou da água para o vinho Rui pôs água na fervura da CPI
Quantity 1206 660 157 100 321 127 341 423 90 125
The Regularities of a Class
The class choosed to be presented here has some regularities that can show some of the procedures we think necessary to approach the treatment of VFE in NLP. The class, named PB-C1P2DN, was constructed after the observation that a set of expressions from the PB-C1P2 class accepts that the last frozen element is unfolded into an NP constituted of a frozen element and a free element: 1. As explicações de Rui puseram lenha na fogueira 2. A descoberta dos documentos pôs lenha na fogueira da CPI It was observed that most expressions like (1) could be considered as a substructure of (2) with the omission of the free element on the right. We considered this kind of regularity sufficient to constitute a new class of expressions. Constructing this new class, it was noted that 80% of its expressions were constructed with the following verbs: pôr, botar, colocar, enfiar, jogar, and meter, which belong to the same semantic field. In fact many expressions may present an alternation of these verbs: 3. Rui (pôs+botou+colocou+enfiou+jogou+meteu+deitou) lenha na fogueira da CPI In spite of this, it can not be concluded that all the expressions of this class have this property: many expressions accept the alternation of some of these verbs, but reject others: 4. FHC (pôs+botou+colocou+enfiou+meteu) a mão no bolso do contribuinte 5. * FHC (jogou+deitou) a mão no bolso do contribuinte Thus, even in a relatively homogeneous class like this, it is necessary to verify each property of each expression case by case. It means that, to all classes of VFE, an exhaustive study of the properties of each expression must be accomplished, and quick generalizations must be avoided. 1
The current notation of Lexicon-Grammar theory is used here: Ni (i=0,1,2,3) is a free NP (the zero index means the NP in subject position) ; Nhum is a human NP; V is the verb; C is the frozen nominal element.
100
O.A. Vale
Fig. 1. Graph describing the class PB-C1P2DN
In the specific case of the PB-C1P2DN class, it could be realized, a posteriori, that the verbs jogar and deitar can not be used when the first frozen element is a "a part of the body". The regularity of this class allows to create a FS graph without the procedure proposed by Senellart [7]. In fact, that procedure is useful for processing large classes, but presents as inconvenience the creation of one FS graph for each expression. The graph editor of Unitex was used to build a graph that assembles most expressions of the PB-C1P2DN class. This procedure allows to visualize the distribution of the elements. It becomes possible due to the little number of verbs in this class and the alternation showed in (3). The graph in Fig. 1 is constructed to present all the details of these expressions. In the graph, neither the passive forms of the expressions nor the possibility of modifiers insertion were shown. It can be seen, for example, that an expression like meter a mão em vespeiro has some constraints about the determiner. In fact, the indefinite determiner can only be used without the free NP on the right. 6. Rui meteu a mão num vespeiro 7. Rui meteu a mão no vespeiro da partilha de bens 8. * Rui meteu a mão num vespeiro da partilha de bens Unitex was used to locate the graph`s expressions in the whole 1997 text of Folha de S. Paulo (about 94 millions of words). The concordance was obtained. It interesting to observe, in that concordance, the strings identified and those which constitute a VFE. It becomes important in a first approach to separate the compositional occurrences and the frozen ones. Five of the nine occurrences of the string
Some Regularities of Frozen Expressions in Brazilian Portuguese
101
of the VFE (
3
Conclusion
The VFE phenomenon needs an actually detailed approach. We think it is necessary to keep in mind that we have approached a small class of VFE, which presents a certain homogeneity in its distribution. For an approach of the other classes, it will be interesting to observe the number of strings that appears only with the frozen sense, and those that have compositional and frozen occurrences. For this kind of expressions, it will be necessary to build a set of local grammars [7] to eliminate the ambiguity.
References 1. Gross, M. Une classification des phrases "figées" du français. Revue québécoise de linguistique, Vol. 11, n. 2 (1982) 151–185 2. Vale, O. A. Expressões cristalizadas do português do Brasil: uma proposta de tipologia. Araraquara, . Tese (Doutorado) - Universidade Estadual Paulista, Araraquara. (2001) 3. Gross, M. Méthodes en syntaxe. Hermann, Paris (1975) 4. Silberztein, M. Dictionnaires électroniques et analyse automatique de textes: le système INTEX. Masson, Paris (1993) 5. Paumier, S. Manuel Unitex. http://www-igm.univ-mlv.fr/~unitex/manuelunitex.pdf (2002) 6. Senellart, J. Reconaissance automatique des entrées du lexique-grammaire des phrases figées. Travaux de linguistique. n. 37 (1998) 109–125. 7. Laporte, E., Monceaux, A. Elimination of lexical ambiguities by grammars: the ELAG system. Lingvisticae Investigationes XXII (1998–1999) 341–367
Selva: A New Syntactic Parser for Portuguese Sheila de Almeida, Ariadne Carvalho, Lucien Fantin, and Jorge Stolfi Institute of Computing, State University of Campinas Cx. Postal 6176, 13084-971 Campinas (SP), Brazil {sheila.almeida,ariadne,lucien.fantin,stolfi}@ic.unicamp.br
Abstract. We describe an ongoing project whose aim is to build a parser for Brazilian Portuguese, Selva, which can be used as a basis for subsequent research in natural language processing, such as automatic translation and ellipsis and anaphora resolution. The parser is meant to handle arbitrary prose texts in unrestricted domains, including the full range of coordination and subordination constructs. The parser operates separately on each sentence, and outputs all its syntactically valid derivation trees.
1
Introduction
Here we describe Selva, a new syntactic parser for Brazilian Portuguese, designed to deal with arbitrary prose text, without domain or context restrictions, and allowing the full range of coordination and subordination constructs. The parser operates separately on each sentence (as delimited by periods or other full stops), and outputs all its syntactically valid parsings. We consider only plain running text, excluding styles with special structures like poetry and headlines. The main obstacles to the robust parsing of real-world text are the enormous number of linguistic constructions that must be handled, and the numerous syntactic ambiguities. The latter can only be resolved by semantic analysis, which requires an intelligent understanding of the whole text and of its origins – which in turn requires an impossibly a vast and complex world model, and powerful logical/probabilistic inference methods. This difficulty is especially serious for unrestricted text, where even green ideas may sleep furiously on occasion. A parser that uses only syntactic criteria, such as word categories and word order, will be unable to choose the correct parsing among all possible parsings for the same sentence; it will have to guess, based on some heuristic or statistical rules. Considering that a typical sentence of moderate length may require dozens of such choices, the chances of making the right guess at evey one will be very small. The only way that a parser can approach 100% success rate is by listing all, or nearly all, syntactically valid parsings for each sentence. Even though an n-word input may admit thousands of such parsings, they are only a tiny fraction of all possible trees with n leaves. We concluded that a tool that found all valid parsings would be very useful for language processing, e.g. as a front-end syntactic filter for a restricted-domain semantic analyzer. N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 102–109, 2003. c Springer-Verlag Berlin Heidelberg 2003
Selva: A New Syntactic Parser for Portuguese
103
Having chosen to generate all possible parsings, we found it best to avoid many traditional rules which are inspired on semantic criteria – such as transitivity – as being too unreliable. We also decided against statistically-based filters, since rule usage probabilities are strongly dependent on semantical context. The Selva parser assumes that the input is syntactically correct, and makes no effort to reject ungrammatical sentences. Finally, we do not try to flag meta-syntactic constructs such as passive voice or cleft predicatives (Ela ´ e quem fez o bolo). On the other hand, in order to keep the size of the output within tolerable limits, and avoid generating invalid parsings for correct sentences, we found it necessary to enforce certain syntactic restrictions, such as person and gender agreement, and to exclude some phrase structures which, although formally valid, are too rare to be worth considering. For instance, we do not recognize clauses with untopicalized object-subject-verb order (as in O carro Maria comprou) – even though they occasionally occur, even in prose.
2
Related Work
There are surprisingly few projects envolving syntactic analysis of Portuguese [10]. Moreover, some of them are commercial projects unavailable for research, while others are developed for very limited domains. We are aware of only two accesible parsers that can be compared to ours: Curupira and VISL. The Curupira parser of Martins, Hasegawa, and Nunes [4] was developed as part of ReGra, a commercial grammar checker [5]. Like our parser, Curupira assumes that the sentence is correct and generates all possible parse trees. The parser is still under development, and its source has not been released. The VISL parser was developed by a team led by Eckhard Bick within the Visual Interactive Syntax Learning project [1]. Apparently the source code is not available, but the parser itself can be used for demonstrations through the Internet. The parser is very robust and produces generally good results; however, it only provides one parsing for each sentence.
3
The Grammar
We encode the syntax by a context-free grammar with markers. The syntax is loosely based on standard Portuguese grammars [7,9,6,2], but we were forced to deviate from them in many points, chiefly due to the absolute lack of semantic information. (Even grammarians who take pains to separate syntax from semantics, like Perini, occasionally define syntactic categories by semantic tests.) Another reason was the need to handle complex coordination phenomena which are ignored by most grammars. Categories and Functions. As usual, we assign two labels to each phrase in a sentence: its syntactic category, and its syntactic role or function within the immediately containing unit. The most important sytactic categories are
104
S. de Almeida et al.
sentence, clause, verb, noun, adverb, adjective, and prepositive (prepositional phrase). (Other categories like preposition and article occur as constituents of these.) These categories are not disjoint; for example, in eu quero correr, the word correr is classified as a verb at a lower level of the parse tree, as a clause at a higher level, and as a noun further up. The top-level syntactic category is the sentence, which we define as a sequence of words delimited by strong punctuation (full stop, colon, semicolon, question, or exclamation marks). The clause category applies to sentences, or parts thereof, that consist of a verb and all its syntactically bound operands and modifiers, including subordinate clauses. (Most sentences are clauses, but occasionally one finds verb-less sentences such as in Ele demorou. Bastante..) Markers. Each category is further subdivided into sub-categories, characterized by certain parameters (markers) which can assume a finite range of values. Thus, for example, the noun class comprises twelve main sub-categories, characterized by three markers: gender (mas or fem), number (sin or plu), and person (1, 2, or 3). These markers are used to implement agreement constraints, and are strictly syntactical, not semantical; and many phrases, such as l´ apis or que saiu, are ambiguous with respect to them. Nouns generally have person 3, except for pronoun-based ones such as tu or apenas eu e vocˆ e. Adjective phrases have the same sub-categories as nouns. Here the person marker is significant only for adjectives built from subordinate clauses: que fomos contratados has person 1, que foram contratados has person 3. Adverbs and prepositives do not have gender, person, or number markers. Verb phrases are sub-classified by four main markers: mood, person, number, and gender. The mood can have seven possible values, such as indicative (ind), subjunctive (sub), past participle (par), etc.. Other markers identify verb phrases which include oblique (clitic) pronouns. We found no compelling reason to mark verb forms for tense (past, present, etc.), or to distinguish copular verbs like ser from ordinary ones. We also did not find it useful to mark verbs for transitivity, since every transitive verb may be used with elided object, and many supposedly intransitive verbs can have special meanings which are transitive. Clauses have several markers, such as the mood of the clause’s main verb. An important one is incompleteness, which characterizes clauses that have an elided noun constituent, like Maria comprou [] ou eu disse que [] sa´ ıram. Such clauses are used to build adjective phrases (see below). 3.1
Clause Structure
A normal clause consists of a single verb phrase, surrounded by optional noun, adverb, adjective, or prepositive phrases, possibly with some punctuation. Each of these constituents has a function in the clause, which can be topic (T ), subject (S ), object (O), object complement (C ), and clause modifier (M ). There may be one or more topics T at the beginning of the clause. Each topic is a noun, adverb, adjective, or prepositive phrase, or an expletive, which
Selva: A New Syntactic Parser for Portuguese
105
includes vocatives and interjections, always followed by a comma, such as O gato, ningu´ em o viu, or Branco, vermelho, qualquer cor serve. Other than topics and appositives (see below), at most three constituents of a clause can be noun phrases; these are assigned the functions S, O and C. The object complement C may also be a non-clausal adjective. The complement occurs, for example, in the sentences ele nomeou Pedro ministro (noun) ele deixou os eleitores frustrados. There may be any number of clause modifiers M, inserted before, between, or after the S, V, O, and C constituents. Each M may be either an adverb, an adjective, or a prepositive. The subject S must agree with the verb in person and number, but there is no such constraint between them and the object O. When the complement C is an adjective, it must agree with the object O in gender and number. Any clause modifiers M which are adjectives must agree with the subject. (Note that these constraints often allow us to distinguish an adjectival C from an adjectival M, as the word frustrados in the previous example.) If we ignore T and M phrases, we have 24 potential orders for the constituents S, V, O, and C ; which become 49 considering that S, O, and C may be absent. Although all these combinations are theoretically valid and occur in special contexts, we found that almost all sentences in our corpus used only the following alternatives: SVOC, SVCO, SOVC, SVO, SVC, SOV, VSO, VS – and their variants without the subject S, e.g. Achei o livro chato (VOC ). Moreover, some of these orders are possible only under certain constraints; for instance, SVCO is not allowed if the complement C is a noun: *o presidente nomeou ministro Pedro. Noun Phrases. The typical noun phrase has the structure M∗ DQ∗ HQ∗ M∗ . All parts may be omitted (under some conditions), except the head word H, which is either a noun, adjective, prepositive, or pronoun. The qualifiers Q may be adjectives or (after the head) prepositives. The determiner D may be an article or demonstrative pronoun; the modifiers (M ) may be adverbs or prepositives. Cf. the example [somente]M [o]D [maior]Q [livro]H [de exerc´ ıcios]Q [verde]Q [que comprei]Q . A noun phrase may also be formed from a subordinate clause, in a number of ways, e.g. [dan¸ car samba] ´ e bom, vi [os cavalos bebendo a ´gua]. A major source of structural ambiguity is the fact that a single adverb, adjective, or prepositive may often be parsed as a constituent of several categories at several levels. The noun phrase a caixa de madeira sem a al¸ ca da tampa admits many different tree structures besides the semantically correct one [[a]D [caixa]H [de madeira]Q [sem [[a]D [al¸ ca]H [da tampa]Q ]]Q ]. Adjectives, Adverbs, and Prepositives. And adjective is either a single word (the headword, H ), or one of several constructs with subordinate clauses, such as que+clause (que Maria comprou) or cujo+noun+clause (cujo carro Maria comprou). A prepositive consists of a preposition (P ) followed by a noun (the body B ) – which may be a subordinate clause, e.g. m´ aquina
106
S. de Almeida et al.
[de [fazer macarr˜ ao]], viaje [sem [sair de casa]]. Non-clausal adjectives and prepositives may be modified by adverbials (M ), as in bem branco or muito de confian¸ ca. The subordinate of que-adjectives must be an incomplete clause (see below) as in o carro [que [Maria comprou []]] and as pessoas [que [eu disse que [] sa´ ıram]]. The second example, where pessoas must agree in person and number with sa´ ıram, shows that markers of clause incompleteness and noun type must sometimes be propagated through several levels of the parse tree. Adverbs can be either a single word or one of several subordinate constructions, such as quando eu nasci, se n˜ ao chover; or arbitrary phrases isolated from the sentence by parentheses, dashes, or paired commas – which include vocatives, expletives, etc.. Adverbs too can be modified by adverbs and prepositives. Many simple adjectives, such as r´ apido, can also be classified as adverbs, and there seems to be no reliable sintactic criterion to distinguish a prepositive used as an adjective (qualifying nouns) or as adverb, except, occasionally, by agreement constraints or similar contextual information. Indirect Object. We found it impossible to distinguish between a clause modifier and an “indirect object” (except a pronominal one). The distinction is traditionally depends on the verb being tagged in the lexicon as “indirect transitive.” However, this approach fails too often to be of much use. In fact, there are examples which are inherently ambiguous, such as O menino gosta de verdade – de verdade may be either what the boy likes, or how much the boy likes it. We also did not find it helpful to mark verbs with its “usual” prepositions (regˆencia), since it seems that any preposition can be used to modify any verb. We were forced to conclude that the concept of “indirect object” is largely a matter of semantics, not syntax. Therefore, we parse prepositionals like de verdade, and the verb-attached weak “indirect” pronouns like lhe and se, as clause modifiers. Compound Verbs. The traditional parsing of clauses like ele vai fazer isso (or ela tinha feito isso) classifies vai fazer (resp. tinha feito) as a “compound verb”, and labels isso as the object of the top-level clause. We found this approach problematic in view of coordinations like ele vai fazer isto e pensar naquilo, or inserted terms like, ele vai apenas me encontrar. Threfore, we chose to parse the main verb in such constructions, together with any direct object and modifiers, as a subordinate noun phrase which is the direct object of the auxiliary verb. Under this view, we must interpret the auxiliary ir as a special sense of the verb, which is transitive – the object being the action that is going to happen. (However, we still have special tratment for clauses where a weak pronoun – direct object or clause modifier – belonging to the main verb is displaced in front of the auxiliary, as in ele me quer ver.) 3.2
Coordination
Coordination is a major difficulty in syntactic analysis. In typical text, besides coordination of whole clauses or major phrase categories (nouns, ad-
Selva: A New Syntactic Parser for Portuguese
107
jectives, verbs, and adverbs), one finds problematic examples such as Tenho camisas com e sem gola, Quero este mas n˜ ao aquele livro. Coordination may also occur between multi-phrase sequences, as in ele est´ a proibido de entrar em ou sair de casa, Maria comprou o carro e vendeu a casa simultaneamente, Ele nomeou Pedro diretor e Maria assistente, etc.. A popular solution is to view such constructs as coordination between clauses, with many elided parts: for instance, Maria compra e vende im´ oveis is parsed as [Maria compra []] e [[] vende im´ oveis], where [] stands for an omitted constituent. However, this interpretation is problematic because it requires forward anaphoric references (cataphora) and the elision of parts which should not be elided, and it also breaks the semantics of adverbials like mutuamente and simultaneamente. Threfore, coordination must be handled as a general phenomenon that can operate on two or more phrases of almost any category X, to produce a phrase of the same category X [8]; or even on groups of phrases which do not form a single syntactic unit, e.g. between two subject-verb pairs as in Jo˜ ao vendeu e Maria comprou o carro.
4
The Pre-processor
Before each input clause is submitted to the parser, it goes through a preprocessor, which breaks the text into words, obtains the word categories (from a large dictionary, which can be supplemented by the user, and some simple heuristics for numbers and proper names) and finally turns it into a list of Prolog clauses. Contractions like dele and lhos are flagged as such in the main dictionary, and are expanded by the pre-processor. Since some some contractions are ambiguous, the result is no longer a sequence of tagged words but rather a branching directed graph. For instance, the clause vamos nos encontrar becomes: •0 → vamos → •1
5
nos
•3 → encontrar → •4 em → •2 → os
The Parser
The Selva parser is implemented in Prolog [3]. Rather than using the DCG parsing facilities built into Prolog, we use a separate translator to map the source grammar into plain Prolog rules. The translator adds an interval constraint to each term of each rule, and reorders the terms so as to avoid infinite recursion. Even though the parser depends on Prolog’s top-down search with backtracking, the extra parameters and rule reordering give it some of the robustness and effciency expected from bottom-up parsers. A typical rule in the source grammar sentence(MOOD, ...) → subject(PERSON, , NUMBER), *verb(MOOD, PERSON, NUMBER), object( , , ).
(1)
108
S. de Almeida et al.
gets translated into the following Prolog code sentence(NI, NS, INF, SUP, MOOD, ..., T) : − verb(N1, N2, INF, SUP, MOOD, PERSON, NUMBER, T2), subject(NI, N1, INF, SUP, PERSON, , NUMBER, T1), object(N2, NS, INF, SUP, , , , T3), buildtree(’sentence 1’, T, [T1, T2, T3]), Each predicate matches directed paths in the input graph with certain properties. The added parameters INF and SUP will be instantiated with number nodes, and specify lower and upper bounds for the nodes in the matched path. Parameters NI and NS specify the actual initial and final node numbers of the path, and are elsewhere required to satisfy INF ≤ NI ≤ NS ≤ SUP. The predicate sentence is satisfied by every path in the input graph that begins at node NI, ends at node NS, and consists of three sub-paths matched by subject, verb, and object, in that order. If the match succeeds, the predicate buildtree defines T as a tree node, whose label ’sentence 1’ identifies the syntax rule, and whose subtrees are the parse trees T1,T2,T3 of the constituent phrases. Note that the translator moved the verb term – the starred item in (1) – to the beginning of the Prolog rule. (The proper order of the sentence’s constituents is still ensured by the NI/NS arguments.) This feature was introduced to avoid infinite loops in syntax rules that begin with recursive optional terms – such as subject, which may be elided, and may be a subordinate clause beginning with a subject. Each recursive attempt to match a subject will have to go through the verb term, which cannot be elided and therefore will consume at least one word. Threfore the subject’s NI and N1 arguments will be constrained to a strictly smaller range of indices at each recursive call.
6
Evaluation and Comparison
In order to evaluate Selva’s performance on real-world texts, we created a small corpus of 80 inputs, by taking the first sentence from the second paragraph of newspaper and magazine articles. The entries averaged 14.7 words each (minimum 5, maximum 37); most of them used subordination, and many had nontrivial coordinations. These clauses were run through a preliminary version of our parser and grammar. In 28 cases, the parser failed to terminate, or did not find the correct parsing (the structure that a human parser would give, using all syntactic and semantic information available). For the remaining 52 cases where the correct structure was found, it produced 51.4 parsing per sentence, on the average (minimum 2, maximum 480). Since this test was performed, we introduced the interval-constraint mechanism and re-wrote large parts of the grammar. Therefore, the above results should be viewed only as an indication that the basic premise – that it is feasible to generate all syntactically valid parsings – is quite realistic. For comparison, we ran the same 80 clauses through the VISL parser. In almost all cases, the single derivation tree that was returned was at least syntactically possible. However, in about half of the cases, it did not match the correct
Selva: A New Syntactic Parser for Portuguese
109
tree, as it would be defined by a human parser – typically because of incorrect guesses about the nesting of prepositional phrases and subordinate clauses. We also tried our corpus on the Curupira parser (version 1.0). That version seems configured to return only the first 4 parse trees found, in the order implied by the ordering of the rules in the grammar. According to the few tests which we have run so far, its coverage seems to be still incomplete. However the parser is still under development, and we expect its performance will improve considerably in future releases.
7
Future Work
We plan to continue to improve the grammar in light of systematic tests. In particular, we plan to tune it, by adding or excluding rules, so as to minimize the number of spurious derivation trees returned while improving the success rate. We also intend to use a more compact representation for the output, namely a single tree with OR nodes to encode the multiple choices allowed for each syntactic node. Such an encoding would allow us to generate and represent exponentially many parse trees for any n-word sentence, at polynomial cost. Acknowledgments. We wish to thank Gra¸ca V. Nunes and the team at NILC – N´ ucleo Interdepartamental de Lingu´ıstica Computacional of the University of S˜ ao Paulo, in S˜ ao Carlos – for kindly providing us a pre-release of the Curupira parser and allowing us to use the ReGra tagged dictionary. This work was supported in part by CAPES and CNPq.
References 1. E. Bick. The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. PhD thesis, Aarhus Univ., 2000. 2. P. Cipro Neto. Gram´ atica da L´ıngua Portuguesa. Scipione, 1997. 3. W.F. Clocksin and C.S. Mellish. Programming in Prolog. Springer, 1994. 4. R.T. Martins, R. Hasegawa, and M.G.V. Nunes. CURUPIRA: Um parser funcional para a l´ıngua portuguesa. Technical Report NILC-TR-02-26, NILC-ICMC, Universidade Estadual de S˜ ao Paulo, December 2002. 5. R.T. Martins, R. Hasegawa, M.G.V. Nunes, G. Montilha, and O.N. Oliveira Jr. Linguistic issues in the development of ReGra: A grammar checker for Brazilian Portuguese. Natural Language Engineering, 4(4):287–307, December 1998. 6. R.M. Mesquita. Gram´ atica da L´ıngua Portuguesa. Saraiva, S˜ ao Paulo, 1999. ´ 7. A.M. Perini. Gram´ atica Descritiva do Portuguˆes. Atica, 1996. 8. R. Quirk, S. Greenbaum, G. Leech, and J. Svartvik. A Comprehensive Grammar of the English Language. Longman, 1985. 9. L.A. Sacconi. Nossa Gram´ atica. Atual Editora, S o Paulo, 1984. 10. D. Santos. Um centro de recursos para o processamento computacional do portuguˆes. Datagrama Zero – Revista de Ciˆ encia da Informa¸ca ˜o, 3(1), February 2002.
An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese∗ 1,2
2
2
2
Sandra Aluísio , Jorge Pelizzoni , Ana Raquel Marchi , Lucélia de Oliveira , 2 2 Regiana Manenti , and Vanessa Marquiafável 1
ICMC – DCCE, University of São Paulo, CP 668, 13560-970 São Carlos, SP, Brazil {sandra,jorgemp}@icmc.usp.br 2 Núcleo Interinstitucional de Lingüística Computacional (NILC), ICMC-USP, CP 668 13560-970 São Carlos, SP, Brazil {raquel,lucelia,regiana,vanessam}@nilc.icmc.usp.br
Abstract. This article identifies and addresses the major linguistic/conceptual, as opposed to logistic, issues faced in the morphosyntactic tagging of MACMorpho, a 1.1 million word Brazilian Portuguese corpus of newspaper articles that has been developed in the Lacio-Web Project. Rather than simply presenting the annotated corpus and describing its tagset, we elaborate on the criteria for establishing the tagset and analyze some interesting cases amongst the linguistic problems we faced in this work.
1 Introduction Annotated reference corpora, such as Suzanne, the Penn Treebank or the BNC have helped both the development of English computational linguistics tools and English corpus linguistics. Manually-annotated corpora with part-of-speech (POS) and syntactic annotation are costly but allow one to build and improve sizeable linguistic resources, such as lexicons or grammars, and also to develop and evaluate most computational analyzers. Usually, such treebank projects follow the Penn Treebank (http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html) approach, which distinguishes a POS tagging and a parsing phase each comprising an automatic annotation step followed by manual revision. Recently, there have been several efforts to build gold standard annotated corpora for other languages than English, such as French, German, Italian, Spanish, Slavic (http://treebank.linguist.jussieu.fr). For Brazilian Portuguese (BP), however, the figure is not so bright. With regard to manual morphosyntactic annotation, to the best of our knowledge, there are only two small Brazilian corpora which were used to train statistical taggers: (i) the 20,982-word Radiobras Corpus [1, 2], and (ii) the 104.966-word corpus built from NILC’s corrected text base spanning 3 genres (news, literature and textbooks) [3]. There are, although, several (Brazilian and European) Portuguese corpora automatically annotated by Bick´s [4] syntactic parser PALAVRAS (http://visl.hum.sdu.dk), which are part of the AC/DC project (http://www.linguateca.pt). ∗
This project is partially funded by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq). We are grateful to E. Bick for parsing MAC-Morpho.
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 110–117, 2003. © Springer-Verlag Berlin Heidelberg 2003
An Account of the Challenge of Tagging a Reference Corpus
111
In order to make freely available both corpora and computational linguistic tools which learn from raw and annotated corpora, such as POS taggers, parsers and term extractors, we have started the Lacio-Web project. Lacio-Web (LW), a two-year project launched at the beginning of 2002, tries to fill the gap with regard to linguistic resources and tools for BP. In this paper we present the rationale for building a 1.1 million-word corpus with manually validated morphosyntactic annotation (the results of the inter-annotator agreement evaluation and further logistic/historical detail have been published in [5]), including the criteria for establishing the tagset (Section 2), some linguistic problems we faced in this work (Section 3) and directions for further work (Section 4). This corpus was taken from a text collection from Folha de São Paulo (http://www.folha.uol.com.br/folha), which gives us high quality contemporary Brazilian Portuguese from different authors and domains. The resulting annotated corpus (named MAC-Morpho) will be available in two versions: in annotators’ format (one word per line followed by its tag) and in the XML-compliant format proposed by the EAGLES [6] (www.cs.vassar.edu/XCES).
2 Designing the Tagset We analyzed the Eagles recommendations for the Morphosyntactic Annotation of Corpora (http://www.ilc.pi.cnr.it) and two of the more important tagsets designed for English (36-tag Penn Treebank Tagset and BNC project’s1 61-tag C5 and 140-tag C7) and three other tagsets for Portuguese (NILC2, PALAVRAS and Tycho Brahe [7] respectively with 36, 14 and 48 tags). Although there are already two tagsets for Portuguese (PALAVRAS and NILC), whose purpose is similar to ours, neither fulfills all the criteria we consider as essential to our project. These criteria have been employed by and large by the Penn Treebank and Tycho Brahe projects. Even though the latter project also tackles Portuguese, it has been specifically designed to support diachronic research and, perhaps due to this, ends up with a conceptually different tagset from ours. 2.1
Criteria, Features, and Previous Work
Recoverability. Exploiting recoverability refers to avoiding tagging (morphological) details that can otherwise be easily recovered by querying a lexicon on the basis of the word and its tag alone. For example, the decision of having a unified “article” tag – instead of two or more, such as “definite/indefinite singular masculine article” – takes advantage of the automatic recoverability of any further features of interest, provided articles are not ambiguous with each other. This criterion ultimately leads to minimal tagsets with the sole purpose of disambiguation, i.e., a tagset suffices as long as every possible pair (word, tag) resolves to at most one single lexical entry (whatever an entry may be) or set of morphologically equivalent entries. NILC Tagset fails to exploit, for instance, the recoverability of the traditional Portuguese pronoun classes,
1 2
http://www.hcu.ox.ac.uk/BNC/what/ucrel.html http://www.nilc.icmc.usp.br/nilc/TagSet/ManualEtiquetagem.htm
112
S. Aluísio et al.
ending up with 10 distinct pronoun tags. Were we to satisfy recoverability solely, 2 simple tags (“relative and non-relative pronoun”) would do exactly the same effect. Syntactic Function (and Actuality). Notwithstanding, recoverability and its related morphological disambiguation efficiency are not enough, since we strictly understand that the ideal tagset should be optimal for supporting a subsequent full syntactic parsing step. In other words, it should entail as much syntactical inference as possible while not requiring its tagger to be a full-fledged parser, paradoxical though it may seem. Thus, recoverability is but a lower-bound measure, ever second to syntactic function, an eminently tag-multiplying factor. The referred paradox is not trivial, and the pitfall of reaching a fully syntactic, or simply overcrowded, tagset may seem unavoidable, at first sight. Fortunately, we believe we quite managed to develop a twofold sound compromising criterion, namely: • intra-word syntactic Distinctness preservation (or D-preservation): any two syntactically distinct occurrences of a word should never receive the same tag; • inter-word syntactic Likeness preservation (or L-preservation): reciprocally, any two syntactically equal occurrences of different words should receive the same tag as long as morphological recoverability is left unharmed. The application of D-preservation to our former two-tag treatment of pronouns (“relative” vs. “non-relative”) leads to LW Tagset’s five pronoun tags, namely PROPESS (personal pronoun, of whatever grammatical case), PRO_KS_REL (relative subordinating pronoun), PRO_KS (non-relative subordinating pronoun, introducing noun clauses, such as “who” in “Please identify who the murderer is.”), PROSUB (nonsubordinating, non-personal pronoun as a nucleus, such as “who/this” in “Who/This is the murderer?”) and PROADJ (non-subordinating, non-personal pronoun as a modifier, such as “this” in “This man is the murderer.”). In these examples and in accordance with the stated criterion, two syntactically distinct occurrences of “who/this” receive accordingly distinct tags. It is worth noticing that, properly exploiting recoverability and syntactic encoding, our five-tag treatment of pronouns is more informative than that of NILC Tagset, despite the latter having twice as many pronoun tags. In time, syntactic function implies syntactic actuality, i.e., tags should clearly reflect the syntactic function of words in the clauses and phrases they belong to, which sometimes means departing from traditional (usually untenable) treatment. One such example is the introduction of the tag ADV_KS_REL (relative subordinating adverb) to account for relative “(P) onde // (En) where”, “quando // when” and “como // how” (the latter is never relative in English, but arguably so in Portuguese), traditionally regarded as pronouns. That is not an unheard-of position, since PALAVRAS also treats these words as adverbs. But maybe a bit too eagerly: according to its POS tagset, e.g. “quando // when” is always an adverb, whereas we understand it may fall into four categories, namely KS (subordinating conjunction, in adverbial clauses), ADV_KS_REL (relative subordinating adverb, in relative clauses), ADV-KS (nonrelative subordinating adverb, e.g. in indirect interrogative sentences) and plain ADV (non-subordinating adverb, e.g. in direct interrogative sentences). To do PALAVRAS justice, however, we should notice that it is a parsing system, not a POS tagger, and its performance seems to be not at all hindered by such simplifications, which is the case exactly because (i) it is not based on the more common tagger-parser pipeline architecture and (ii) it avails itself of a host of secondary morphosyntactic tags. The
An Account of the Challenge of Tagging a Reference Corpus
113
application of L-preservation is exemplified while discussing the immediately following criteria. Consistency and Indeterminacy. A tagset is worth nothing if it does not provide for consistency, i.e. if its users (not only corpus annotators) are not likely to agree (including with themselves!) on how and when to use each tag. Even if we only employed one single all-consistent, all-efficient annotator, users must be able to evaluate, understand and ultimately replicate their work. The pursuit of consistency is paramount, even if to the detriment of other requirements. In specific, consistency is not usually very partial to refinement, which here means syntactic or morphological detail. One such example is the contrast between past participles in adjectival position (e.g. “(P) a casa pintada // (En) the house (that has been) painted”) and adjectives proper zero-derived from past participles (e.g. “(PBr) uma moça muito falada // (En) a 3 young woman very much gossiped about”), whose annotation was intended by the Lacio-Web team at first, but had to be eventually abandoned due to low interannotator consistency. The solution here was to resort to indeterminacy, introducing the (indeterminate) PCP tag, standing for “past participle or adjective zero-derived therefrom”. Indeterminate tags are created by collapsing inconsistency-mongering tags, thus leading to smaller tagsets. Nonetheless, it is not always that indeterminate tags are the best solution for inconsistency problems. Sometimes, just sound application of other criteria might come to one’s rescue. One ever-lasting source of debate and inconsistency in Portuguese has been the contrast between nouns and adjectives. Unlike their English counterparts, most Portuguese nouns and adjectives can be used interchangeably, making it hard to determine the actual morphological specification of these words and whether nominalization is really taking place, so used to this operation are we native speakers. By simply prioritizing syntactic function, or rather, by upholding L-preservation, we were able to circumvent this delicate problem, the result being thus: every open-/closedclass occurrence happening to be the nucleus of a noun phrase is tagged N/PROSUB; and every open-/closed-class occurrence happening to modify a noun, ADJ/PROADJ or ART (article, whether definite or not). Even the words traditionally called “numerals” usually fall into either N or ADJ, again according to the syntactic function of each occurrence. Only cardinal numerals and all inflections of the word “(P) meio // (En) half” may receive the tag NUM (numeral), and do so only when occurring as noun modifiers, due to their remarkably distinct syntactic behavior in such cases. Therefore, those “numerals” never happening to be real noun modifiers (e.g. “bilhão/milhão // billion/million”, “dezena // ten”, “terço // third”, “quarto // quarter”) will never be tagged NUM. Learnability. Finally, we cannot fail to mention that a most limiting factor to how syntactic LW Tagset could get was, at all times, the assumption of a machine learning technology to apply to (a version of) the annotated corpus, namely that usual in POS taggers and blind but to a very few words contiguously surrounding the current target word. Therefore, it seemed just fair to avoid all refinement that was really not likely to be learnt, such as NILC Tagset’s annotation of verb transitivity.
3
Notice that, unlike English “gossiped”, Portuguese “falada” cannot be accounted for by productive passive voice processes. That is exactly why the latter is regarded as a zeroderived adjective proper.
114
S. Aluísio et al.
It is worth noticing at this point that it has never been our aim to deliver a ready-touse training corpus, but rather one providing for (i) rapid (i.e. automatic) deployment of variously tagged (e.g. for various levels of refinement) training versions of itself and thus (ii) extensive and comprehensive experimentation. Just by way of illustration of how not ready to use our corpus is, it should suffice to mention that some of its tokens are actually groupings of contiguous tokens in the original, resulting in what we call “compounds” (morphosyntactic units made up by two or more words, such as “(P) devido=a // (En) due=to”), which are tagged regularly as if they were but one single word. Rather more training-friendly, in contrast, NILC Tagset also employs multiword morphosyntactic units, but tags each of their tokens separately with the same tag. Naturally, contiguous multiword units having the same tag will pose a segmentation problem to NILC Tagset’s users. 2.2
The Current Tagset
Since the beginning of its development, in July of 2002, LW Tagset (Tables 1 and 2) has undergone cyclic revisions, being currently in its ninth version. Table 1. Regular tags
Table 2. Complementary tags
Tag ADJ ADV-KS-REL
Definition open-class noun modifier relative subordinating Adverb
Compl. Tag |EST |AP
ADV-KS
Non-relative subordinating Adverb
|+
ADV ART KC KS IN N NPROP NUM PCP PDEN PREP PROPESS PRO-KS-REL PRO-KS
Non-subordinating adverb Article coordinating conjunction coordinating conjunction interjection open-class noun phrase nucleus proper noun numeral as a noun modifier past participle or adjective emphasis/focus preposition personal pronoun relative subordinating pronoun Non-relative subordinating pronoun non-subordinating pronoun as a noun phrase nucleus Non-subordinating pronoun as a modifier Auxiliary verb Non-auxiliary verb Currency symbol
|! | [ beginning, |... middle part, | ] and end of discontinuous compound (further discussed in Section 3)
PROSUB PROADJ VAUX V CUR
|TEL |DAT |HOR |DAD
Definition foreign apposition contraction/ enclitic mesoclitic
phone number date time formatted data not falling into above categories
At present it comprises 22 regular POS tags along with nine orthogonal complementary tags. The latter are thus called because they add to the information of the POS tags, to which they are optionally appended by means of the “|” symbol.
An Account of the Challenge of Tagging a Reference Corpus
115
3. Some Emblematic Linguistic Challenges NPROP – Proper Noun. In most respects, proper nouns are but nouns, especially in the relation they bear to noun phrases. What sets them apart is the prerogative to refer to one single entity of the real world in that, if X is a proper noun, X might even be shared by more than one entity (e.g. homonymous people), but that would imply no common properties whatsoever to sharers. Consequently, we should tag NPROP all those words that would otherwise be tagged N but happen to have strictly unitary extensions/indeterminate intensions. Such is our criterion for identifying proper nouns, which, clear though it may seem, makes plenty of room for inconsistency. Problematic cases usually fall into the following categories: • motivated NPROPs, or rather, those obtained by zero-derivation, e.g. “(PBr) Nordeste (Brazilian geopolitical unit) // (En) the Northeast”, “Congresso // the Congress”; • metonymical NPROPs, e.g. “(PBr) gillette // (En) (brand of) razor blade”, “bandaid”, “danone // (brand of) yogurt”, “fusca // a specific make of car or car of this make”; • NPROPs with context-dependent cardinality extensions, e.g. “(P) sol // (En) sun”, “lua // moon” (cf. “A lua está bonita! // The moon is beautiful!” and “Quantas luas tem Júpiter? // How many moons does Jupiter have?”), “Congresso // Congress”; • NPROPs with apparently (and arguably) unitary extensions, e.g. “(P) xadrez // (En) chess”, “HIV”, “gripe // flu”. Compounds. The treatment of groups of words as morphosyntactic units (resulting in compounds, marked by replacing spaces between their elements with the “=” symbol) is at one time imperative and dangerous. It is imperative because, otherwise, how could one tag e.g. “apesar/acerca/cerca” apart from preposition “de” as in “apesar/acerca/cerca de”? It is also dangerous because it is always difficult to establish clear criteria to decide whether to treat a given group as a compound. We chose the following ones: • non-analyzability, which has already been implied, applying to “(P) apesar=de // (En) in=spite=of”, “devido=a // due=to” and suchlike, and sanctions compounds (i) whose part-wise tagging is impossible or much too artificial, generating syntactically exceptional sequences of tags or (ii) whose (semantic) value seems not to be computable from the individual value of its elements; • trade-off, recommending e.g. the consideration of many compound prepositions (“(P) antes=de // (En) prior=to”, “depois=de // after”, “perto=de // close=to”, “longe=de // away=from”, etc.) which could even be tagged as pairs of adverb plus preposition (introducing a complement of the corresponding adverb). However, we believe the latter possibility imposes an unnecessary cost on a subsequent syntactic analysis, since those are highly co-occurring items, expressing basic semantic relations (of time/space, among others) and generally behaving like any other one-word preposition; • non-productivity, strongly correlating with non-analyzability and avoiding groups that, in fact, contain a currently productive syntactic-semantic structure, or rather, that are actually open-class. This criterion, for example, sanctions
116
S. Aluísio et al.
“(P) a=cavalo // (En) on=horseback” and “a=pé // on=foot” while banning “de carro/ônibus/trem/etc. // by car/bus/train/etc.” As one can see, our criteria are tenable, though a bit fuzzy, resulting in some of our highest inter-annotator inconsistency rates [5], in spite of some consistency-assurance devices we have devised (such as a central repository of compounds and candidates thereof). It is worth noticing that nearly as much as half the inconsistency is related to the creation of compound proper nouns, which is small wonder if one considers (i) how often proper nouns are in journalistic texts and (ii) how difficulty it is to determine how many proper nouns (only one or more) should be found in e.g. the following phrases: “(P) Departamento de Computação do Instituto Tecnológico da Aeronáutica // (En) Department of Computation of the Airforce Technology Institute”; “Safári do Quênia // Kenia Safari”; “GP da Austrália de F1 // Australia’s Formula One Grand Prix”; “o SESC de São Carlos // São Carlos SESC”. Discontinuity. One important, perhaps novel feature of LW Tagset’s is the possibility of expressing discontinuity of morphosyntactic units, or rather, handling discontinuous occurrences of compounds, whether occasionally or necessarily so. That is realised by means of the complementary tags “[”, “…” and “]” (respectively denoting beginning, inner part and end of discontinuous unit) and seemed to be a good solution for two serious problems, namely: • “o mais ADJ/ADV possível”: in Portuguese, structures like “(P) o(a) mais rápido(a) possível // (En) as soon as possible”, “o mais eficiente(s) possível // as efficient as possible”, “o mais à vontade possível // as at one’s ease as possible” are hardly susceptible, if at all, to analysis on a word-by-word basis (it is vital to notice that both “o” and “possível” are invariable, while inner adjectives are not). Even if we were to group “o mais” into a compound, how should we tag “possível” and it as independent entities? It seemed all the more appropriate to treat the whole “o=mais=possível” as a compound adverb and enable compound discontinuity. Hence the problematic structure can now be tagged thus: “o=mais_ADV|[ ADJ/ADV possível_ADV|]”; • Compound Disruption: perfectly eligible compounds have sometimes their usual continuity disrupted by extraneous elements inserted for emphasis or to prevent repetition of terms. Take e.g. the compounds “(P) apesar/antes=de_PREP // (En) in=spite=of/prior=to”. They may well happen to occur as “apesar/antes até mesmo de // even in spite of/prior to”, which can now be tagged thus: “apesar/antes_PREP|[ até=mesmo_PDEN de_PREP|]”. One interesting example coming from our corpus is the following: “(P) ...atingem níveis internacionais devido tanto à valorização interna quanto à valorização... // (En) ...reach international levels due not only to internal valorization but also to...” tagged thus: “...atingem níveis internacionais devido_PREP|[ tanto_KC|[ a_PREP|]|+ a_ART valorização interna quanto_KC|] a_PREP|]|+ a_ART valorização... // ...reach international levels due_PREP|[ not=only_KC|[ to_PREP|] internal valorization but=also_KC|] to_PREP|]... ” It is worth noticing that this device seems to be quite suitable to represent diverse binary coordinating structures (“(P) tanto ... quanto/não só ... mas também // (En) not only ... but also”, “nem/já/ora ... nem/já/ora // either ... or/now ... now ...”, among others).
An Account of the Challenge of Tagging a Reference Corpus
117
4 Current and Future Work We have developed MAC-Morpho, a 1.1-million-word Brazilian Portuguese reference corpus which shall be freely available on the Lacio-Web Project page (http://www.nilc.icmc.usp.br/nilc/projects/lacio-web.htm). The total cost of tagging this huge corpus, including research on tagsets and tagging projects, corpus creation, writing the tagset manual, annotators’ training, converting from Bick´s tagset to our tagset, weekly meetings with the annotators and revision took 11 months and involved 7 man month, 4 of them annotating the corpus. We ran two experiments to estimate inter-annotator agreement which presented kappa values in the .81–1.00 interval, namely 0.944 and 0.955, showing almost perfect agreement. The next steps will be a finer-grained correction phase of MAC-Morpho tackling the problems observed in the experiments and a tagset evaluation following [8].
References 1. Marques, N.C., Lopes, J.G.P.: A Neural Network Approach to Portuguese Part-of-Speech Tagging. Anais do II Encontro para o Processamento Computacional de Português Escrito e Falado (1996) 1–9 2. Villavicencio, A., Viccari, R.M., Villavicencio, F.: Evaluating Part-of-Speech Taggers for the Portuguese Language. Anais do II Encontro para o Processamento Computacional de Português Escrito e Falado (1996) 159–167 3. Aires, R.V.X., Aluísio, S.M., Kuhn, D.C.S., Andreeta, M.L.B., Oliveira Jr., O.N.: Combining Multiple Classifiers to Improve Part of Speech Tagging: A Case Study for Brazilian Portuguese. Proceedings of SBIA'2000 (2000) 20–22 4. Bick, E.: The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press (2000). 5. Aluísio, S. et al.: An account of the challenge of tagging a reference corpus of Brazilian Portuguese. Technical Report 188 – ICMC-USP (2003). Also Available at http://www.nilc.icmc.usp.br/~lacio_web/ 6. Macleod, C., Ide, N., Grishman, R.: The American National Corpus: Standardized Resources for American English. Proceedings of the Second Language Resources and Evaluation Conference (LREC) (2000) 831–36 7. Galves, C., Britto, H.: A Construção do Corpus Anotado do Português Histórico Tycho Brahe: O sistema de anotação morfológica. Proceedings of PROPOR 99 (1999) 81–92. 8. Déjean, H.: How to Evaluate and Compare Tagsets? A Proposal. Proceedings of the Second Language Resources and Evaluation Conference (LREC) (2000). Also available at http://www.sfb441.uni-tuebingen.de/~dejean/
Multi-level NER for Portuguese in a CG Framework Eckhard Bick Institute of Language and Communication, Southern Denmark University [email protected] http://visl.sdu.dk
Abstract. This paper describes and evaluates a linguistically based NER system for Portuguese, based on lexico-semantical information, pattern matching and morphosyntactic, context driven Constraint Grammar rules. Preliminary Fscores for cross-domain news texts, when distinguishing six different name types, were 91.85 (raw) and 93.6 (subtyping of ready-chunked proper nouns).
1 Introduction Named entity recognition (NER) in running text is a complex task with a number of obvious applications – semantic corpus annotation, summarisation, text indexing, to name but a few. This work focuses on Portuguese, and strives to distinguish between 6 main name type categories by linguistic rather than statistical means. 1.1 Previous Work In recent years, a number of different approaches have been carried out and evaluated by the NLP community. Thus, at the MUC-7 confeence (1998), competing systems were tested on broadcast news. The best performing system (LTG, Mikheev et. al. 1998) used both sgml-manipulating hand-crafted symbolic rules, a Hidden Markov Modeling (HMM) POS-tagger, name lists, partial probabilistic matching and semantic suffix classes, achieving an overall F-measure1 of 93.39, with recall/precision rates of 95/97, 91/95 and 95/93 for person, organisation and location, respectively. HHMresults in isolation (“learned-statistical”) can be regarded as a kind of baseline against which more sophisticated systems should be measured. A well performing example is the Nymbel system (Bikel et. al. 1997), which achieved F-scores of 93 and 90 for English and Spanish news text, respectively. Also, results for English were shown to be fairly stable down to a training corpus size of 100.000, indicating the cost/performance efficiency of the approach. Another automatic learning method, based on maximum entropy training (MENE), is described by Borthwick et. al. (1998). This system showed a clear increase in performance with growing training corpora, with in-domain F-scores of 89.17 for 100.000 tokens and 92.20 for 350.000 1
Defined as 2 x Recall x Precision / (Recall + Precision)
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 118–125, 2003. © Springer-Verlag Berlin Heidelberg 2003
Multi-level NER for Portuguese in a CG Framework
119
tokens. The authors also stress the potential of hybrid systems, with a maximum Fscore of 97.12 when feeding information from other MUC-7-systems into MENE. One possible weakness of trained systems is indicated by the fact that in MUC’s crossdomain formal test, F-scores dropped to 84.22 and 92 for pure and hybrid MENE, respectively. Another interesting base line is human annotators’ F-score, which at MUC-7 was reported as 96.95 and 97.60 (Marsh & Perzanowski, 1998). 1.2 Methodological and Data Framework In this paper, I shall present a mostly linguistic approach to NER, combining different lexical, pattern and rule based strategies in a multi-level Constraint Grammar framework (CG). This approach has previously been succesfully carried out for Danish (Bick, 2002) within the Scandinavien NER project Nomen Nescio. For Portuguese, the system builds on a pre-existing general Constraint Grammar parser (PALAVRAS, Bick 2000) with a high degree of robusticity and a comparatively low percentage of errors (less than 1% for word class). Tag types are word based and cover part of speech, inflexion and syntactic function, as well as dependency markers. The language data used in this article are drawn from the CETEM Público news text corpus (Rocha & Santos, 2000), which has been grammatically annotated in a joint venture between the VISL project at Southern Denmark University and the Linguateca-AC/DC project at SINTEF, Norway (Santos & Bick, 2000). 1.3 Discussion of Name Categories For this project, proper nouns are defined as upper case non-inflecting words or word chains with upper case in the first and last parts. Simplex names in lower case (e.g. pharmaceuticals) are treated as nouns, as are nouns with upper case initial in midsentence, though the latter may be marked as <prop> with a secondary tag for later filtering by corpus users. In agreement with general Nomen Nescio strategy, 6 core categories were used (human, place, organisation, event, title, and brand/object): Human Personal Names
120
E. Bick
Buildings which metaphorically can offer, invite or earn, receive a separate subcategory, institution
2 System Architecture and Strategies The system treats NER as a distributed task, matching the progressive level architecture of the parser itself, applying different techniques at different levels.
Multi-level NER for Portuguese in a CG Framework
121
2.1 Preprocessing Besides more “ordinary” preprocessing tasks like sentence separation, this first module creates ‘=’-linked name chains based on pattern matching (Edvard=Munch). Upper case is the obivous trigger (with some sentence-initial ambiguity), but certain lower case particles (da, di, von, ibn) are allowed in the chain, as are numericals in certain patterns (version numbers, car names). A particular problem is the recognition of in-name punctuation (initials, Sta., H.M.S., jr., & Co., d’Ávila, web-urls, e-mails). Though the preprocessor does check its chunking against full name entries in the lexicon, proper nouns are a very productive class, and heuristic patterns may lead to overchunking (diretor de marketing para a Europa da Logitech). Here, a second lexiconlookup checks for recognizable parts of a name chain candidate, and re-splits it. Palmer & Day (1997) compared the coverage of inter-corpus name vocabulary transfer in 6 languages and found the second highest transfer rate for NEs in Portuguese (61.3%), after Chinese (73.2%) and way above English (21.2%), suggesting the importance of a lexicon module and gazeteer lists in Portuguese NER. 2.2 The Name Type Predictor Some frequent names receive a semantic type tag already from the lexicon based morphological analyzer module (otherwise handling lemmatizing, inflexion and derivation). However, most proper nouns have to be typed later. The name type predictor is a semi-heuristic module, which has its own lexicon (ca. 16.000 entries), enabling it to match parts of names, for instance recognizing person names by looking up Christian names first parts. Similarly, Shell=Portuguesa is typed as
122
E. Bick
Of course, there may be interferences and contradictions between the patterns, so matchings are ordered, conditioned and iterated, and they do allow some ambiguity. Finally, the type predictor uses non-alphabetic characters, paired upper case, function words etc. to assign <non-hum> tags, preventing over-usage of this most common category in the cg-based part of the system. 2.3 The CG Modules CG adds, selects or removes tags from words, i.e. performs word based annotation by mapping and resolving ambiguities. Rules use sentence-wide context and have access to word class, inflexion, verbal and nominal valency potential as well as - in the Portuguese system - semantic prototype information for nouns and some verbal selection patterns. The “ordinary” (preexisting), morphological and syntactic CG levels consist of about 7000 rules. Though only a small part of these tackles proper nouns, it is much safer to contextually disambiguate, say, sentence initial imperatives from heuristic proper nouns, than at the pattern matching stages. Of course, proper nouns can also for their part form valuable context for the disambiguation of other classes, and besides functioning as subjects and objects like other np’s, they can fill certain more specific syntactic slots: @N< (valency governed nominal dependents): o presidente americano George Bush @APP (identifying appositions): uma moradora do palácio, Júlia Duarte, ... @N
Multi-level NER for Portuguese in a CG Framework
123
Coordination Based Type Inference. Drawing on and matching syntactic tags from the syntactic CG-module, the name type-mapper first establishes a secondary tag for "close/safe coordinators" (&KC-CLOSE), with one rule for each matched syntactic function, then uses it for disambiguatione: REMOVE %non-h (0 %hum-all) (*-1 &KC-CLOSE BARRIER @NON->N LINK *1C %hum OR N-HUM BARRIER @NON-N<); SELECT (
3 Evaluation Performance. Though the NER module of PALAVRAS is an unfinished project, a pilot evaluation study was performed on a 45.000 word chunk from the CETEM Público corpus, containing 2672 name chains (4764 tokens). In the light of the above mentioned difference between MUC F-scores for same domain and cross-domain, it has to be stressed that the Público sample was domain-mixed (“politics”, “culture”, “social issues”, “economy”, “opinion” and sports). The table below indicates the relative weight of two different types of errors, one preprocessor and PoS-based (2.2 % chunking and proper name errors as such), the other name typing error themselves (6 %), which together make for error rate of 8.2%. More than half of all name tokens had no lexicon entries, and as might be expected, recall was lower for these names. (89.1% as opposed to 94.9%). However, the 5% typing error rate even for lexicon2 based names indicate problems with ambiguity resolution, lexicon errors and a certain price
2
One reason is that the name lexicon was in large parts compiled automatically from a variety of text sources, and manual checking so far has been incomplete at best.
124
E. Bick
for the fact that contextual rules are allowed to override the lexicon. Interestingly, few errors occur between subcategories of the same major category. Table 1. Público (jan. 2003) 45099 words
correct found (6 classes) of these non-heur alone wrong major class (6 classes) wrong subclass (same major) false positive PROP reading (incl. “overchunking”) false negative (missing) PROP (incl. “underchunking”) all evaluated proper nouns3 of these: not in lexicon
all PROP (5.9% of words) 4764 tokens cases percent 2453 91.8 % 1198 94.9 % 168 6.3 % 14 0.5 % 10+23=33 1.2 %
118 14 5+14=19
8.4 % 1.0 % 1.3 %
13+14=27
0+12=12
0.9 %
1.0 %
heuristic PROP (i.e. no lexicon entry) 52.7 % cases percent 1255 89.1%
2672 1409
2672 1409
Recall and Precision. Since types were almost completely disambiguated, and since false positive and false negative chunking errors were of similar frequency, recall and precision were similar, 91.8% and 91.9%, respectively, resulting in an F-score of 91.85. For subtyping alone (of correctly chunked and prerecognized proper nouns), an F-Score of 93.6 was measured. The table below shows distribution and performance for the 6 super-categories used. Table 2. Name type Person
distribution 35.5 % 22.6 % 34.5 % 2.2 % 5.4 % 1.1%
Recall 97.9 % 93.3 % 93.3 % 82.1 % 84.3 % 53.8 %
Precision 87.7 % 95.4 % 96.9 % 96.5 % 84.3 % 60.9 %
F-Score 92.5 94.3 95.1 88.7 84.3 57.1
It is interesting that major categories outperformed minor ones, suggesting a systematic/heuristic rule bias towards the former. Especially
3
Ignoring 17 cases of garbled corpus input with upper case.
Multi-level NER for Portuguese in a CG Framework
125
4 Conclusion This paper has presented a linguistic approach to NER, showing how lexical, pattern and rule based name typing tools can be integrated into a multi-level CG system for Portuguese. Though still immature, the name recognizer module has demonstrated promising results on mixed-domain news texts, with an overall F-Score of 91.85 with a 6-way category distinction. An F-Score of 93.6 for name type recognition of correctly chunked proper nouns, and a 2% chunking error rate suggest that improvements in lexicon enhanced preprocessing might improve overall performance. Future work will also focus on tuning CG name rules to Portuguese language data, and proofreading the name lexicon. Since recall and precision varied significantly across categories, rules should concentrate on precision for person names, and recall for the minor categories, in particular.
References 1.
Bick, Eckhard: The Parsing System ‘Palavras’ – Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus University Press, Århus (2000) 2. Bick, Eckhard: ”Named Entity Recognition for Danish”. I: Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004. Forthcoming (2003). 3. Bikel, Daniel M. & Miller, Scott & Schwartz, Richard & Weischedel, Ralph: Nymble: a High-Performance Learning Name-finder. In: Proc. of the Conf. on Applied Natural Language Processing 1997 4. Borthwick, Andrew & Sterling, John & Agichtein, Eugene & Grishman, Ralph: NYU: Description of the MENE Named Entity System as Used in MUC-7. In: Proc. of the 7th Message Understanding Conf. (MUC7), April 29th–May 1st, Fairfax (1998) 5. Iason, Demiros et. al.: Named Entity Recognition in Greek Texts. In: Proceedings of the 2nd Int. Conference on Language Resources & Evaluation (LREC), 2000 6. Marsh, E. & Perzanowski, D.: MUC-7 evaluation of I.E. Technology: Overview of Results. In: Proc. of the 7th Message Understanding Conf. (MUC7), April 29th–May 1st, Fairfax (1998) 7. Mikheev, Andrei & Grover, Claire & Moens, Marc: Description of the LTG System used for MUC-7. In: Proceedings of the 7th Message Understanding Conference (MUC7), April 29th–May 1st, Fairfax (1998) 8. Palmer, David D. & Day, David S.: A Statistical Profile of the Named Entity Task. In: Proceedings of the Fifth Conference on Applied Natural Language Processing March 31st–April 3rd 1997 9. Rocha, Paulo A. & Santos, Diana: CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. In: Maria das Graças Volpe Nunes (ed.): Actas do V. PROPOR, Nov. 19th–22nd, Atibaia (2000), pp. 131–140 10. Santos, Diana & Bick, Eckhard: Providing Internet access to Portuguese corpora: the AC/DC project. In Gavriladou et al. (eds.): Proc. 2nd International Conf. on Language Resources and Evaluation, LREC2000 (Athens, 2000), pp. 205–210. 11. Stevenson, Mark & Gaizauskas, Robert: Using Corpus-derived Name Lists for Named Entity Recognition. In: Proc. of the Sixth Conf. on Applied Natural Language Processing, Seattle, 2000
HMM/MLP Hybrid Speech Recognizer for the Portuguese Telephone SpeechDat Corpus Astrid Hagen1 and Jo˜ ao P. Neto1,2 1
L2 F Spoken Language Systems Lab INESC-ID, Rua Alves Redol 9, Lisbon, Portugal {Astrid.Hagen,Joao.Neto}@l2f.inesc-id.pt 2 Instituto Superior T´ecnico, Portugal
Abstract. In this article, we describe an automatic speech recognizer developed for Portuguese telephone speech. For this, we employed the Portuguese SpeechDat database which will be described in detail, giving its recording conditions, speaker characteristics and contents categories. The automatic recognizer is a state-of-the-art HMM/MLP hybrid system employing different kinds of robust acoustic features. Training and testing was carried out on the clean digits and numbers part of the database. The recognition results show competitive performance to similar systems developed for other languages.
1
Introduction
SpeechDat is a series of projects to collect speech data, which are funded by the European Union1 . The aim of the SpeechDat data collections is to establish spoken language resources for the development of voice operated teleservices and speech interfaces. Spoken language resources are speech databases including annotations, pronunciation lexica, and material for the creation of language models, which are needed for the development and use of speech recognition (and synthesis) technology. During the recording of speech data, the type of microphone (and its position) can already drastically influence the speech signal. Even more importantly, recordings over the telephone line introduce severe distortions due to the telephony transmission channel. With the large variety of telephone gadgets and transmission line characteristics which exist today, such attenuation distortions are hard to predict. The limited bandwidth of the transmission channel of 200/300-3200/3400 Hz additionally restricts the quality of the speech presentation. For these reasons, utilizing a speech recognizer over the telephone line which had been trained on data not recorded over a telephone line or sometimes even only on a different transmission channel can lead to a severe degradation in recognition accuracy. Thus, the availability of large, telephone-recorded databases is important to the research and development of competitive, stateof-the-art speech recognizers for teleservice applications. Such a database is now 1
http://www.speechdat.org
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 126–134, 2003. c Springer-Verlag Berlin Heidelberg 2003
HMM/MLP Hybrid Speech Recognizer
127
also available for (European) Portuguese and we present its main characteristics in the following section. In Sect. 3, we describe our HMM/MLP hybrid recognizer developed on this database and give first test results in Sect. 4.
2
Database Description
The Portuguese SpeechDat database2 has been developed within the SpeechDat project to address current and future requirements in the field of telecommunication, spoken language technology and research [8]. It has been recorded in two phases over the public telephone network involving a large set of speakers, recording conditions and tasks. In the first phase (SpeechDat 1), there were 1,000 speakers involved, in the second phase (SpeechDat 2) 4,000 speakers. The Portuguese SpeechDat database was collected by Portugal Telecom via digital line (ISDN). The design and post-processing of the database, including linguistic annotation, was carried out by INESC3 . The design of the collection platform and the recording of the speech data itself were effectuated by INESCTEL. Speech signals are recorded at 8kHz, 8-bit A-law format. The database comprises 14 CDs (3 CDs for SpeechDat 1 and 11 CDs for SpeechDat 2). 2.1
Call Description
Each telephone call included in the database comprises two parts: a first part in which the caller was asked to provide spontaneous answers to nine questions (cf. Table 1), and a second part in which he/she should read a list of 33 items. The answers about “name” and “telephone number” were not included in the CDs due to privacy restrictions. Table 1. The nine SpeechDat categories used in the first part of each call to produce spontaneous speech Est´ a pronto a come¸ar? Are you ready to start? Por favor, diga o seu nome. Please say your name. Diga o seu n´ umero de telefone. Say your telephone number. Qual a data do seu nascimento? What is your birthday? Qual a cidade (ou distrito) em que passou In which city (or district) have you spent a maior parte da sua infˆ ancia? the largest part of your childhood? ´ do sexo masculino? E Are you male? Est´ a a usar um telem´ ovel? Are you using a mobile phone? Est´ a a usar um telefone sem fios Are you using a cordless phone? Que horas s˜ ao? What time is it?
2 3
http://www.l2f.inesc.pt/resources/spdat/speechdat.html http://www.l2f.inesc-id.pt
128
A. Hagen and J.P. Neto
After responding to the spontaneous part, the caller is asked to read the sheet number and is then prompted with 32 items to read, which correspond to this sheet. The sheet number consists of a 4-digit number which the caller is asked to read as a digit sequence. Some callers, however, did not stick to this guideline but preferred to read it as a natural number. The prompted items contain: an isolated digit, three natural numbers, a credit card, a telephone and a PIN number, two money amounts, two dates, one time indication, six application words, three spelled words, three word spotting phrases and nine phonetically rich sentences. 1. The isolated digits are the 10 digits zero (zero) to nove (nine), and the female forms of “one” and “two”: uma and duas. 2. The natural numbers include the digits and all multiples of 10 and 100, mil and the word e (and). 3. The credit card numbers consist of 4 times 4 digits, e.g. 8654 3374 1250 6017, whereas the telephone numbers comprise 6 to 7 digits, approximately corresponding to the distribution of the telephone numbers in Portugal at that time (40% with 6 digits, 60% with 7 digits). 4. The money amounts contain small (< 10,000$00) and large (> 10,000$00) amounts, as well as the Portuguese words for the former Portuguese currency escudos, centavos (cents) and contos (1,000 escudos). 5. The time phrases include five different types: – meio-dia (midday), – (meia-noite) e um quarto, (a quarter past midnight), – (uma) e meia (half past one), – um quarto para (meia-noite) (a quarter to midnight), and – (duas) e um|dois|trˆ es..., as well as the following days: ontem (yesterday), hoje (today), amanh˜a (tomorrow). 6. The dates have the following form:
HMM/MLP Hybrid Speech Recognizer
129
10. The phonetically rich sentences were created in such a manner as to include in each sentence at least two examples of each phone and as many different triphones as possible. These items were presented in an alternating fashion in order to avoid fatigue of the speaker. 2.2
Speaker Recruitment and Characteristics
The speakers were recruited among the employees of Portugal Telecom. As the company has a wide geographical coverage, a good representation of many regional accents was guaranteed. The distribution of male and female speakers amounts to 45% male and 55% female speakers, in the age of fourteen years old to older than sixty. Most speakers are from continental Portugal, but some speakers from other areas, such as the A¸cores (28 speakers), Madeira (8), Africa (32), Macau (1) and others (9) were also included. Most of the speakers born in Africa (Angola, Mo¸cambique, Guin´e, Cabo Verde, S˜ao Tom´e and Principe) have been living in continental Portugal for many years so that their original accents have been reduced. 2.3
Database Annotation
For each available speech signal file exist a corresponding description file and a comments file, in which e.g. the gender and origin of the speaker are stored as well as the transcription of what was uttered. The utterances were annotated on the word level by three experienced annotators. The speech data of the first phase (SpeechDat 1) was labeled for start and end point of 13 different noise cases. In the second phase (SpeechDat 2) the noise cases were merged into four remaining classes and roughly marked in every utterance. These four noise classes are: – – – –
filled pauses: [ah], [eh], [hum], . . . , speaker noise: loud breath intake, throat clearing, coughing, . . . , non-speaker noise: line noise, radio playing, background voices, . . . , other evens: truncated or mispronounced words, background noise, . . . .
The lexicon of the entire database consists of approximately 19,744 words. The broad phonetic transcription was carried out using the SAMPA symbols. Only one pronunciation is indicated per word and corresponds to the pronunciation used in the region of Lisbon and usually in the media. The transcription was automatically generated and then manually corrected by a phonetician.
3
ASR System Setup
In this section, we describe the training and test sets, the acoustic modeling used in this work, as well as the vocabulary and language models employed.
130
3.1
A. Hagen and J.P. Neto
Training and Test Set Definition
Given the size of the corpus with its many different contents categories, we decided to concentrate on the the digits and numbers part of the database, more precisely the categories B1, C1–4 and I1 described in Table 2. These categories are especially important to such application domains as credit card and account number validation, automated dialing, user identification via PIN codes, and others. In this work, sentences of which the transcription contained markers for truncated speech, mispronunciations, or unintelligible speech, or noise markers for speaker noise or line/background noise were disregarded and only the clean utterances were used. The training and cross-validation set for the digits and numbers part of the SpeechDat 1 and 2 database comprises 9981 clean4 utterances (13h 24min of speech), roughly equally distributed in terms of utterances over the six numbers categories as shown in Table 3. The test set consists of 929 clean utterances (1h 14min of speech), distributed as shown in Table 3. The sets correspond to the defined partitioning of the speakers into training and test set as given on the SpeechDat CDs, so that each speaker was only used in either of the sets. Table 2. Illustration of the digits and numbers classes of the SpeechDat database Class Class contents ID B1 10 isolated digits C1 C2 C3 C4 I1
Example To read 0965423871
As has been read ”zero nove seis cinco quatro dois trˆes oito . . . ” Sheet number 33546 ”trˆes trˆes cinco quatro seis” Telephone number 090981696 ”zero noventa nove oito um seis nove seis” Credit card number 4585 4567 . . . ”quatro mil quinhentos e oitenta e ...” PIN code 159.160 ”cento e cinquenta e nove mil cento e . . . ” 1 isolated digit 6 ”seis”
Table 3. Distribution of the utterances in the training and cross-validation set (left) and in the test set (right) over the six classes of the SpeechDat database used here
B1: C1: C2: C3: C4: I1: SUM
4
Training 1461 1606 1770 1621 1566 1957 9981
Test 110 144 179 180 117 199 929
“Clean” here signifies no speaker or background noise though moderate noise introduced by the telephone network is a natural consequence of the recording conditions.
HMM/MLP Hybrid Speech Recognizer
3.2
131
Acoustic Modeling
An alignment was created with Gaussian models, using flat start, and then refined with the MLPs, using the clean utterances of the better labeled first part of the corpus (SpeechDat 1). These MLPs were then used to align the clean utterances of the second part (SpeechDat 2). The whole set of training utterances was then re-aligned several times. The use of reliable features is a key issue in the design of an automatic speech recognition system. In this work we investigate 3 feature streams (i) 12 PLP cepstra and the log energy, (ii) 12 RASTA-PLP cepstra and the log energy, and (iii) 28 Modulation Spectrogram (MSG) features, extracted on windows of 20ms with a frame shift of 10ms. The PLP and RASTA-PLP features are extracted from the auditory spectrum after filtering the power spectrum with trapezoidally shaped filters applied at roughly Bark intervals, equal loudness pre-emphasis and cube root compression. The following cepstral analysis calculates the 13 cepstral coffecients [5]. For the RASTA-PLP features, an additional filtering is applied after decomposition of the spectrum into critical bands. This RASTA filter suppresses the low modulation frequencies which are supposed to stem from channel effects rather than from speech characteristics. The PLP and RASTAPLP streams were augmented by their delta features. For the extraction of the MSG features the frequency domain is divided into 1/4 octave bands, resulting in 14 bands, each of which is filtered with two modulation frequency pass-bands, the first ranging from 0-8 Hz, the second from 2-8 Hz. The two sets of 14 coefficients are then concatenated to give the MSG feature vector of 28 coefficients. We work in the framework of HMM /MLP hybrid systems where the posterior probabilities at the output of the MLP are, after division by the priors, used as scaled likelihoods in the HMM for decoding [2,7]. The MLP uses 7 frames of context information (except for the MSG features where 9 frames are used) in order to better account for coarticulation effects and to model the phone changes in more detail. The hidden layer consists of 2,000 nodes (2770 for MSG), and the number of output nodes corresponds to the number of speech units in the digits and numbers part of the SpeechDat corpus. We investigated the use of two different phone sets: (i) context-independent (CI) monophone models and (ii) context-dependent (CD) triphone models. The MLPs trained to estimate context-independent observation probabilities use 31 output nodes (one for silence), as only 30 monophones occur in the numbers part of the corpus. The remaining 7 nodes which usually correspond to the remaining monophones were not trained as these monophones did not occur in the digits and numbers part of the database. It is advantageous to model phonetic units with a sequence of probability distributions rather than with a single distribution only, in order to capture some of the dynamics of the phonetic segments. For this reason, the HMM state of each monophone model is repeated three to six times, depending on the respective monophone. In order to better exploit the large acoustic input available to the MLP, context-dependent triphone models were investigated next. The use of triphones implies enlarging the output layer of the MLP. More (speech unit) classes at the
132
A. Hagen and J.P. Neto
output of the MLP renders the MLP more difficult to train and increases the need for more training data. For the digits and numbers task, this is still feasible as the number of occurring triphones is limited and the size of the MLP’s output layer will not increase too much. To train the MLPs to output posterior probabilities for triphone models, we need a frame-level alignment for the triphones. For this, we substituted in the monophone-based alignment each monophone label by a new label which depended on both the monophone’s left and right context: e.g. the monophone transcription of the word ’dois’ (two) ’d of y ch’ will result in the triphone transcription: ’ ?-d-of d-of-y of-y-ch y-ch-?’. (The ’ ?’ marks the begin and end of a word.) This gave us a set of 151 triphone labels (word-internal only), used at the output of the context-dependent MLPs. This alignment was then used to train these context-dependent neural nets. The triphone HMM models use 3 states for duration modeling. Only the silence model uses just one state without duration modeling.
3.3
Vocabulary and Language Modeling
The vocabulary consists of 51 words for which an internal transcription was available. These words cover the 10 isolated digits, the 2 female forms “uma” and “duas”, and the natural numbers as described in Sect. 2.1. Only 30 of the Portuguese phones actually occur in the digits and numbers, so that the phone set could be restricted to these 30 phones. The language model (LM) was set up on the training utterances, using the CMU-Cambridge Language Modeling Toolkit V2.05. The Good-Turing method was used to estimate the closed-vocabulary, back-off bigram LM which contains 2601 bigrams. Missing bigram combinations which did not occur in the training data were manually added. The perplexity of the LM on the test set is 10.73.
4
Experiments and Results
Experiments were carried out with HMM/MLP hybrid recognizers employing PLP [5], RASTA [6] or MSG [3] features. The results of the three systems are given in Table 4 for both the context-independent (CI) monophone models and the context-dependent (CD) triphone models. Table 4. % Word error rates (WERs) of each of the three feature streams as employed in our HMM/MLP hybrid recognizer
PLP MSG RASTA
CI models CD models 7.2 6.6 7.3 6.8 8.0 8.4
HMM/MLP Hybrid Speech Recognizer
133
The recognizers employing PLP or MSG features resulted in lowest word error rates (WERs) for both CI and CD modeling. The RASTA features which are also PLP-based but include a further filtering during feature extraction gave significantly worse results. RASTA filtering is usually necessary in noise corrupted speech and speech recorded over very different telephone lines. Although the SpeechDat database was collected over a large set of different telephone connections, the clean part of the SpeechDat corpus which we chose for these experiments seems to be rather homogeneous and does not need any additional noise filtering. For the PLP and MSG feature sets, it is the context-dependent triphone modeling which enhances recognition performance, due to its modeling of larger contexts and coarticulation effects. In the case of the PLP features, the improvement in WER is significant. For the MSG features, the difference is not significant at a confidence level of 97.5%. These results can e.g. be compared to results reported on the telephonerecorded OGI Numbers 1995 database [1] where authors report WERs of 6.8% using RASTA-PLP features and 9.8% using MSG features [9], and about 7.1% using PLP featuers [4].
5
Conclusions
In this article, we presented the Portuguese SpeechDat database, which is the first telephone-based, large-vocabulary speech database available in (European) Portuguese. We described its main features, such as recording conditions, speakers, vocabulary, and linguistic annotations, and the development of a speech recognizer trained and tested on the (clean) digits and numbers part of this corpus. The results show competitive performance to state-of-the-art (numbers) recognizers developed on telephone databases in other languages. We plan on extending our work on the Portuguese SpeechDat database to large-vocabulary recognition. Moreover, we want to use the noise annotations of the better labeled first part of the database in order to investigate and develop noise models which will help us to better annotate also the second part of the corpus. The final goal is to create speech recognizers robust to various kinds of different speaker, line and background noises. Acknowledgments. Astrid Hagen was supported by the Portuguese FCT (Funda¸c˜ao para a Ciˆencia e a Tecnologia) scholarship SFRH/BPD/6757/2001. Additionally, this work was partially funded by the FCT project POSI/33846/ PLP/2000. INESC-ID Lisbon had support from the POSI Program of “Quadro Comunit´ ario de Apoio III”.
References 1. Center for Spoken Language Understanding, Department of Computer Science and Engineering, Oregon Graduate Institute. Numbers Corpus, Release 1.0, 1995.
134
A. Hagen and J.P. Neto
2. H. Bourlard and N. Morgan. Connectionist Speech Recognition. A Hybrid Approach. Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061 USA, 1994. 3. S. Greenberg and B.E.D. Kingsbury. The modulation spectrogram: In pursuit of an invariant representation of speech. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pages 1647–1650, 1997. 4. Astrid Hagen. Robust speech recognition based on multi-stream processing. PhD ´ thesis, D´epartement d’informatique, Ecole Polytechnique F´ed´erale de Lausanne, Switzerland, 2001. 5. H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4):1738–1752, April 1990. 6. H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. RASTA–PLP speech analysis technique. IEEE Trans. on Signal Processing, 1:121–124, 1992. 7. N. Morgan and H. Bourlard. Continuous speech recognition. IEEE Trans. on Signal Processing, pages 25–41, 1995. 8. SPEECHDAT. European speech databases for telephone applications (EU-project LRE-633140). In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, 1997. 9. S.L. Wu, B. Kingsbury, N. Morgan, and S. Greenberg. Incorporating information from syllable-length time scales into automatic speech recognition. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, 1:721–724, 1998.
Managing Linguistic Resources and Tools David M. de Matos, Joana L. Paulo, and Nuno J. Mamede L2 F – Spoken Language Systems Laboratory INESC-ID Lisboa/IST, Rua Alves Redol 9, 1000-029 Lisboa, Portugal {david.matos,joana.paulo,nuno.mamede}@inesc-id.pt http://www.l2f.inesc-id.pt/
Abstract. We present Galinha, a system that integrates multiple linguistic resources and tools. Galinha enables easy module integration and testing of prototypical configurations, thereby reducing the effort and backtracking usual in the construction of modular applications. Moreover, it has a soft learning curve, enabling new users and developers to use it successfully.
1
Introduction
Large R&D groups are often presented with the problem of reusing existing resources and tools. These may have been produced in-house or they may be third-party modules. In either case, the task of managing them is not simple: for instance, some tool may be available but may be deemed to hard to reuse for a particular task, causing the redevelopment of a similar tool. This makes application construction more difficult and diverts productive effort to tasks that have already been carried out. If reuse is a problem, the contact between old tools and new users is also a critical issue. The problem here is often in terms of the time required to acquire the necessary expertise to fully and productively use some resource. To address the above issues, we present a web-based user interface for building modular applications. The interface has proved to be quite useful in allowing new users and non-specialists to assemble and test complex prototypes: the only requirement is a clear understanding of the meaning of the data used by each module – a requirement much less stringent than understanding the modules themselves. This document is organized as follows: Sect. 2 briefly presents the underlying support system. The interface itself is presented in Sect. 3; design and usage issues are also covered here. Section 4 discusses the development and deployment of new modules, as well as the integration of existing ones, in our framework. Finally, a few remarks about related systems and architectures are presented (Sect. 5) as well as directions for future work (Sect. 6).
2
Infrastructure
The infrastructure used to support the interface is a partial implementation1 of the theoretical interconnection model proposed in [4]. The Galaxy Communicator system [7] 1
Currently, the main capabilities have been implemented, but things like message type checking is still in infancy.
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 135–142, 2003. c Springer-Verlag Berlin Heidelberg 2003
136
D.M. de Matos, J.L. Paulo, and N.J. Mamede
Fig. 1. The support infrastructure: Galaxy system parts and dedicated servers
was selected to provide messaging support for the infrastructure’s message exchanges. Galaxy has a distributed hub-and-spoke message-based architecture optimized for constructing spoken dialogue systems. It was selected because it was already being used by us for other purposes and because the new purpose did not in any way affect existing uses. This conjunction of factors allowed us to easily migrate/integrate existing modules to the new framework with minimal repercussions. Figure 1 shows the interconnection infrastructure along with two custom control servers: the MultipleWebClient and the hub controller. Figure 2 shows the infrastructure within the interfacing system. The MultipleWebClient is a gateway that routes interface calls to the underlying system, allowing multiplexed communication with external clients: a dispatcher receives requests from the web interface and spawns workers to handle them. This layer exists to enable the infrastructure to serve more than one request at a time. Due to Galaxy design options, a dedicated connection would otherwise be needed and it would have to adhere to the system’s event handling methodology, something that would not be to our advantage. The problem was solved by resorting to undocumented Galaxy features (support was kindly provided by the Galaxy team). The required functionality may become available in future Galaxy versions. The hub controller server, a meta-information manager, ensures correct system behavior: the first request from the user interface is for a description of the infrastructure itself. The hub controller sends this description to the upper levels, allowing the interface to present a list of servers and programs.
3 The Interface Galinha (Galaxy Interface Handler) simplifies access to modules, applications, and library interfaces: it enables users to access and compose modules using a web browser. The application server is one of the interface’s key components. It provides the runtime environment for the execution of the various processes within the web application. Moreover, it maintains relevant data for each user, guarantees security levels and manages access control. The interface also uses the application server as a bridge between the interface’s presentation layer (HTML [15]/JavaScript [5], at the browser level) and the infrastructure. The presentation layer consists of a set of Java classes, servlets, and server pages (JSPs). It is built from information about the location (host and port) of the MultipleWebClient that provides the bridge with the Galaxy system the user wants to contact; and
Managing Linguistic Resources and Tools
137
Fig. 2. The Galaxy infrastructure, control servers, and user-side levels
from XML [14] descriptions of the underlying Galaxy system (provided by the hub controller – see Sect. 2). Besides allowing execution of services on user-specified data, the interface allows users to create, store, and load service chains. Service chains are user-side service- or program sequences provided by the servers connected to the infrastructure: each service is invoked according to the user-specified sequence. Service chains provide a simple way for users to test sequences of module interactions without having to actually freeze those sequences or build an application. The interface allows not only inspection of the end results of a service chain, but also of its intermediate results. Service chains may be stored locally, as XML documents, and may be loaded at any time by the user. Even though, from the infrastructure’s point of view, service chains simply do not exist, selected service chains2 may be frozen into system-side programs and become available for general use. Using the Web Interface To use the interface, users must first specify the location – host and port – of the back-end system. Then, the interface presents the main view, divided into four areas (see Fig. 3): the top one provides a general control menu; this frame is always available. The left frame presents the back-end system’s description and allows access to servers and programs at any time, each of which in turn has additional subdivisions. The frame to the right presents the service chain currently active, if any. The main frame (also the main input area) presents various states of the system or of its interaction with the user. When a service is selected, its description is presented to the user, stating the input and output ports, as well as a description of its actions. Service selection also presents the user with the list of possible operations on that module. On the right hand frame, the active service chain’s name appears at the top followed by a collapsible free text description. The rest of the frame contains the list of modules in the service chain, as well as the state of their interconnections. 2
Tested and approved by the infrastructure’s administration, according to relevant criteria.
138
D.M. de Matos, J.L. Paulo, and N.J. Mamede
Fig. 3. Execution of a simple service chain. Main window shows all partial results
Each service in a chain can be in one of two states: complete, i.e., all input and output connections have been specified; or incomplete (the default). The number to the left of the service’s name also indicates this state: a light box indicates a completed service whereas a dark one indicates that at least one of the service’s ports remains unconnected. A chain becomes executable when all of its services are complete. After execution of a service chain, resulting data (both partial and final) may be viewed in the web browser or saved to a file.
4
Module Definition
Modules are included within the infrastructure in two ways: the first is to create the module anew or to adapt it so that it can be incorporated into the system; the second is to create a capsule for the existing module – this capsule then behaves as a normal Galaxy server would. Whenever possible or practical, we chose the second path. Favoring the second option proved a wise choice, since almost no changes to existing programs were required. In truth, a few changes, mainly regarding input/output methods, had to be made, but these are much simpler than rebuilding a module from scratch. Mainly, changes were caused by the requirement that some of the modules accept/produce XML data in order to simplify the task of writing translations. This is not a negative aspect, since the use of XML as intermediate data representation language also acts as a normalization measure: it actually makes it easier for future users to understand modules’ inputs and outputs.
Managing Linguistic Resources and Tools
139
Fig. 4. The ATA system: once defined, the chain implementing linguistic enrichment may be reused by other, possibly unrelated, applications. The symbol ⊗ marks connections performing data format translations
The first modules included in the system were simply for test purposes and were not, in any case, very complex. The first practical test took place when production modules had to be incorporated into the system. Several modules, namely SMorph [1] (morphological analysis), PAsMO [11] and MARv [13] (morphological processing), and SuSAna [2], based on AF [6] (syntactic analysis), were added. Writing general adapter modules (such as data format converters) was also required. In both cases, the work to be done proved to be simple (only data format manipulations were required). As an example, one of our co-workers, with no previous contact with the system, was able to successfully have new modules incorporated into the infrastructure. In addition to existing servers (the ones mentioned above), new servers were used to build the ATA system [12], an automatic term extractor that uses linguistic and statistic information (Fig. 4). The first of the new modules enriches the text with statistical information about words and noun-phrases. The second decides whether each candidate term is in fact a term (taking into account corpora-based statistical information). Existing modules for morphological analysis and processing were reused, but, since they accept/produce different data formats, two other modules were added to provide data format conversion through XSLT [16] templates. All that was needed to integrate an existing XSLT processor into the system was the coding of a wrapper to call the external application. The wrapper was simple enough that we were able to generalize it so that other similar applications could be incorporated in this way into the infrastructure. This line of work simplifies the overall integration process and enables users to add new modules knowing only how they are activated. Building the ATA system had a beneficial side-effect: the creation of a reusable chain. This chain may now be used for other purposes (in parallel with ATA), by other applications running on top of the same infrastructure. This freedom when making new connections allows users to explore new applications for “old” chains.
140
5
D.M. de Matos, J.L. Paulo, and N.J. Mamede
Related Work
Although our work is not directly related to the field of data modeling, we can take advantage of data and metadata descriptions, such as UML [9] specifications. These specifications can be represented using the XML Metadata Interchange [10] format and, thus, easily processed. Also, they can be used to specify the schemata for the data being sent/received by infrastructure modules. UML, thus, allows for graphical module and module interconnection descriptions and, by extension, the description of complete applications. The work presented here, using the Galaxy infrastructure, is a simple way of delivering messages from on module to another. Others, such as CORBA-based [8] communication systems, could be used as long as the basic interchange model is respected, much as Galaxy does (it is not the reference implementation). Note that, unlike most software infrastructures for language engineering research and development, e.g. GATE [3], the model underlying the infrastructure does not say anything about any module’s function and does not impose any restrictions on their interfaces. Thus, the entire framework is application- and domain-independent.
6
Conclusions and Future Directions
The interface is useful for application development, since it focuses exclusively on the data flowing to/from of each module, without regard for module internals, including the implementation language or internal data representation. Significant dependency reductions can be achieved and module reuse boosted. Another consequence is that modules can be almost anything and run almost anywhere, as long as a communication channel can be established between them.Also, the use of text frames as a communication media allows for flexible module deployment. As shown, for the ATA system, only two of the six processing modules had to be included in the infrastructure (the other were reused). In addition, the infrastructure is now richer, since the two new modules become available for use in other contexts. The problems encountered concerned data format translation and were easily solved through the inclusion of general mapping steps, using XSLT (as described in Sect. 4). Note that the whole ATA system was built by someone who had no previous experience with the infrastructure. The learning curve was both fast and comfortable (the whole integration task took approximately three days). The interface, by hiding most of the complexities of the underlying system and of the modules attached to it, empowers non-expert users to play with various scenarios and to investigate possible differences. This is particularly important in environments such as schools, in which students have to become acquainted with the tools relevant for a particular field; and whenever it is desirable to have a fast learning curve, e.g. when new people integrate a project team. The advantages for expert users lie in simplified prototype construction as well as in application configurations more amenable to change. Also, most irrelevant aspects (those outside the application’s domain) may me safely ignored, i.e., the modules may be treated as real black boxes.
Managing Linguistic Resources and Tools
141
The underlying model is useful not only in helping in application construction, but also as a guide to thinking about modular applications: the interface materializes this aspect, since it allows non-expert users to successfully design applications and application components. Regarding future developments in the interface: while the one described was aimed only at module users, we envision the development of another interface, this time aimed at helping module developers integrate their work for use in the system. This interface can be developed as an extension to the current one, or as a completely independent one. Another development is to allow multiple chains on the client side. This would make the system more flexible in reusing chains without these having to be transformed in infrastructure-resident programs (reusable, but less amenable to changes). As a means of efficiently reuse the result sets of previous computations, we intend to include caching capabilities on the client side of the interface handler, thus avoiding unnecessary server/infrastructure calls and improving bandwidth usage. Acknowledgements. We would like to acknowledge the work by Jo˜ao Gra¸ca and Alexandre Mateus on the web interface.
References 1. S. A¨ıt-Mokhtar. L’analyse pr´esyntaxique en une seule e´ tape. Th`ese de doctorat, Universit´e Blaise Pascal, GRIL, Clermont-Ferrand, 1998. 2. F. Batista. An´alise Sint´actica de Superf´ıcie e Coerˆencia de Regras. MSc thesis, Instituto Superior T´ecnico, UTL, Lisboa, 2003. 3. H. Cunningham,Y. Wilks, and R.J. Gaizauskas. GATE – a General Architecture for Text Engineering. In Proc. of the 16th Conf. on Computational Linguistics (COLING96), Copenhagen, 1996. 4. D.M. de Matos, A. Mateus, J. Gra¸ca, and N.J. Mamede. Empowering the user: a data-oriented application-building framework. In Adj. Proc. of the 7th ERCIM Workshop “User Interfaces for All”, pages 37–44, Chantilly, France, October 2002. European Research Consortium for Informatics and Mathematics. 5. ECMA International, Geneva, Switzerland. Standard ECMA-262 – ECMAScript Language Specification, 3rd edition, December 1999. See also: http://www.ecma.ch/. 6. C. Hag`ege. Analyse syntaxique automatique du portugais. Th`ese de doctorat, Universit´e Blaise Pascal, GRIL, Clermont-Ferrand, 2000. 7. Massachusetts Institute of Technology (MIT), The MITRE Corporation. Galaxy Communicator (DARPA Communicator). See: http://communicator.sf.net/. 8. Object Management Group (OMG). Common Object Request Broker Architecture (CORBA). See: www.corba.org. 9. Object Management Group (OMG). Unified Modelling Language. See: www.uml.org. 10. Object Management Group (OMG). XML Metadata Interchange (XMI) Specification, 2002. See: www.omg.org/technology/documents/formal/xmi.htm. 11. J.L. Paulo. PAsMo – P´os-An´alise Morfol´ogica. Technical report, L2 F – INESC-ID, Lisboa, 2001. 12. J.L. Paulo, M. Correia, N.J. Mamede, and C. Hag`ege. Using Morphological, Syntactical, and Statistical Information for Automatic Term Acquisition. In E. Ranchhod and N. Mamede, editors, Advances in Natural Language Processing, 3rd Intl. Conf., Portugal for Natural Language Processing (PorTAL), pages 219–227, Faro, Portugal, 2002. Springer-Verlag, LNAI 2389.
142
D.M. de Matos, J.L. Paulo, and N.J. Mamede
13. R. Ribeiro, L. Oliveira, and I. Trancoso. Morphossyntactic Disambiguation for TTS Systems. In Proc. of the 3rd Intl. Conf. on Language Resources and Evaluation, volume V, pages 1427–1431. ELRA, 2002. ISBN 2951740808. 14. World Wide Web Consortium (W3C). Extensible Markup Language (XML). See: www.w3.org/XML. 15. World Wide Web Consortium (W3C). HyperText Markup Language (HTML). See: www.w3.org/MarkUp. 16. World Wide Web Consortium (W3C). The Extensible Stylesheet Language (XSL). See: www.w3.org/Style/XSL.
Using Morphossyntactic Information in TTS Systems: Comparing Strategies for European Portuguese Ricardo Ribeiro1 , Lu´ıs Oliveira2 , and Isabel Trancoso2 1
INESC-ID Lisboa/ISCTE INESC-ID Lisboa/IST Spoken Language Systems Lab R. Alves Redol, 9, 1000-029 Lisbon, Portugal {Ricardo.Ribeiro,Luis.Oliveira,Isabel.Trancoso}@inesc-id.pt 2
Abstract. To improve the quality of the speech produced by a Text-toSpeech (TTS) system, it is important to obtain the maximum amount of information from the input text that may help in this task. This covers a wide range of possibilities that can go from the simple conversion of non orthographic items to more complex syntactic and semantic analysis. In this paper, we present the development of a morphossyntactic tagging system and analyze its influence on the performance of a TTS system for European Portuguese.
1
Introduction
The information obtained by a morphossyntactic tagging system can be relevant in several areas of natural language processing. For example, knowing the partof-speech of a given word allow us to predict which words (or word-types) can occur in its neighborhood. That kind of information is useful in the language models used for speech recognition. Morphossyntactic information can also be used by automatic term acquisitions systems or information retrieval systems to select special words (or word-types) or to know which affixes a given word can take. In the same way, a morphossyntactic tagger can help a Text-to-Speech (TTS) system improve the quality of the produced speech. The first stage of a TTS system is a Text Analysis module, whose purpose is to generate tagged text that will be submitted to the Phonetic Analysis module. Then the next module is the one responsible for the Prosodic Analysis. Pitch and duration information are attached in this phase and the controls for the Speech Synthesis module are generated. The Speech Synthesis module then renders the appropriate voice sound. There are three basic phases in the Text Analysis module: document structure detection; text normalization; and linguistic analysis. The one that concerns us in this paper is the inclusion of a morphossyntactic tagger in the linguistic analysis. The information obtained by a morphossyntactic tagging system is relevant to the Phonetic and Prosodic Analysis modules. Concerning the Phonetic Analysis module, in Portuguese, as in other languages, the pronunciation of a word N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 143–150, 2003. c Springer-Verlag Berlin Heidelberg 2003
144
R. Ribeiro, L. Oliveira, and I. Trancoso
can depend on the word class (or part-of-speech, lexical tag, morphossyntactic class, etc.). For example, the word “almo¸co” is pronounced “almo¸co” (close “o”) if used as a noun, and pronounced “alMO¸co” (open “o”) if used as a verb. The same happens with the word “object” in English. “OBject” if used as a noun and “obJECT if used as a verb. Thus, knowing the part-of-speech may help the system produce correct pronunciations for some homograph words. Furthermore, it may also help identifying special classes of vocabulary for which specific pronunciation rules are needed. Morphossyntactic information may also influence the performance of the Prosodic Analysis module, contributing to prosodic phrasing and accentuation. Usually, words are spoken continuously until some linguistic phenomena introduces a discontinuity that can be of various forms. Although it is commonly agreed that prosodic structures are not fully congruent with syntactic structures, morphossyntactic information can help to predict where these discontinuities can occur and of what type they can be [13]. In terms of accentuation, a very basic method to decide if a word is accentable or not may be based on the part-of-speech category of that word, accenting “all and only the content words” [7]. The content words belong to major open-class categories such as noun, verb, adjective, adverb, and certain closed-class words such as negatives and some quantifiers. The next section describes the part-of-speech tagging system developed for Portuguese. Section 3 describes the corpus and the tagset we have used for developing the system, and the lexicons involved. Before concluding, we compare the results obtained by the developed system with the ones achieved by other taggers based on different approaches, considering the effects of the different classes of errors on the performance of the complete TTS system.
2
Morphossyntactic Tagging System
The morphossyntactic tagging process we have implemented consists of the two sequential steps illustrated in figure 1. The separation between morphological analyzis and ambiguity resolution was motivated by the fact that neolatin languages, such as Portuguese, are highly inflectional when compared with English. In this sense, morphological analysis can be relevant. In fact, on the one hand, linguistic oriented systems are usually based on the elimination of the ambiguity previously introduced by a lexical analysis process, and, on the other hand, in data-driven approaches, information is derived from corpora and due to data sparseness word forms may not appear with all possible tags or even not occur at all [8,10]. The morphological analysis module adopted is Palavroso, a broad coverage morphological analyzer [9] developed to address specific problems of Portuguese language like compound nouns, enclitic pronouns and adjectives degree. As a result it gives all possible part-of-speech tags for a given word. If a word is not known, it tries to guess possible part-of-speech tags, always giving an answer. The disambiguation module, developed in the context of this work, is MARv (Morphossyntactic Ambiguity Resolver). MARv’s architecture comprehends two
Using Morphossyntactic Information in TTS Systems
145
Fig. 1. Architecture of the morphossyntactic tagging system
modules: a linguistic-oriented disambiguation rules module and a probabilistic disambiguation module. The ambiguity is first reduced by the disambiguation rules module and then the probabilistic module produces a fully disambiguated output. The disambiguation rules module is based on a set of contextual rules developed specifically for Portuguese. The rules have the following structure: an input trigger section; an if -condition; and an action section.
Input: AMB = ‘‘A= Nc V=’’ If (-1/TAG = ‘‘S=’’) then SELECT ‘‘Nc’’ Fig. 2. Disambiguation rule
As shown in figure 2, the input trigger consists of a simple condition where it is verified if the observed input matches an ambiguity class (AMB) or a given word. If the rule is triggered, the if -condition is evaluated. The terms involved have the following format: (position relative to the observed input/keyword [ = | = ] value)
146
R. Ribeiro, L. Oliveira, and I. Trancoso
where keyword can be TAG, AMB or WORD. The actions to be performed may be of two types: a selection (SELECT) of a single tag or a removal (REMOVE) of a set of tags. The actual set of rules includes 35 rules [6]. The probabilistic-based disambiguation module is based on Markov models and uses the Viterbi algorithm to find the most likely sequence of tags for the given sequence of words, and the forward algorithm to compute the lexical probabilities. The forward algorithm is presented in [1]. The forward probability (αi (t)) is the probability of producing the w1 , · · · , wt word sequence and ending on the state wt /Ti , where Ti is the ith tag of the tagset. αi (t) = P (wt /Ti , w1 , · · · , wt ) Then we can derive the probability of a word wt being an instance of lexical category Ti as P (wt /Ti |w1 , · · · , wt ) =
P (wt /Ti ,w1 ,···,wt ) P (w1 ,···,wt )
Estimating the value of P (w1 , · · · , wt ) by summing over all possible sequences up to any state at position t, we obtain: P (wt /Ti |w1 , · · · , wt ) ∼ =
αi (t) αi (t)
j=1,N
An in depth description of this system can be found in [11].
3 3.1
Linguistic Resources Corpus
The corpus used for training and testing was developed in the LE-PAROLE European project [2] in which harmonized reference corpora and generalist lexica were built according to a common model for the 12 European languages involved. The corpus used in the present work is a subset of about 290,000 running words of the collected 20 million running words corpus for European Portuguese. This subset was morphossyntactically tagged using Palavroso and manually disambiguated. The tagset had about 200 tags with information that varied from grammatical category to morphological features that could be combined to form composed tags (resulting in about 400 different tags). The information coded by the tagset is presented in Table 1. The tagset was fully harmonized between all the languages involved. Each tag is an array, and each position of the array codes one of the features presented in Table 1, saving the first for the grammatical category and the second for the subcategory. When a position (category, subcategory or feature) is not used, its code is replaced by an equal sign. For example, R=n means adverb with no subcategory, in normal degree. This corpus was subdivided into training and test subsets. The training corpus has about 230,000 running words and it covers about 25,000 different word forms. The test corpus has about 60,000 running words, of which about 900
Using Morphossyntactic Information in TTS Systems
147
Table 1. Morphossyntactic information Category Noun Verb
Subcategory proper common main auxiliary
Adjective
Pronoun
Article
personal demonstrative indefinite possessive interrogative relative exclamative reflexive definite indefinite
Adverb Adposition Conjunction Numeral Interjection Unique Residual Punctuation
coordinative subordinative cardinal ordinal mediopassive foreign abbreviation acronym symbol
Features gender and number mood; tense; person; gender and number degree; gender and number
Tag Np Nc V=
A= Pp Pd Pi person; gender; Po number; case Pt and formation Pr Pe Pf Td gender and number Ti degree R= formation; gender and number S= Cc Cs Mc gender and number Mo I U Xf Xa Xy Xs O
are words marked as errors, 21,000 are ambiguous (34.6%) and the remaining 38,000 are non-ambiguous. It includes around 10,000 different word forms, with 1.73 tags per word on average and 30.69% different ambiguous word forms. The tagset used by the taggers was obtained by down-sizing the LE-PAROLE tagset to 54 tags. Only the information about the grammatical category and subcategory was retained. 3.2
Lexica
The lexicon used by the probabilistic module of the disambiguation system has about 25,000 entries with associated probabilities. All the information in the lexicon was obtained from the above training corpus. In order to analyze the influence of the taggers in the Phonetic Analysis module, we used the main lexicon of the Portuguese version of Festival. This lexicon contains about 79,000 different entries, each characterized by morphossyntac-
148
R. Ribeiro, L. Oliveira, and I. Trancoso Table 2. Ambiguities that influence the Phonetic Analysis module Ambiguity Different word forms (%) A= Nc V= 0.876% A= Np V= 0.009% A= V= 2.957% Cc Nc 0.001% I R= V= 0.001% Mc Mo 0.005% Mc Mo Nc 0.001% Mo Nc 0.001%
Ambiguity Different word forms (%) Mo V= 0.005% Nc Np V= 0.051% Nc Pd Pp Td 0.003% Nc R= V= 0.007% Nc V= 3.936% Np Xf 0.023% R= V= 0.013% S= V= 0.017%
Table 3. Evaluated taggers Identification Description Approach Markov models tagger integrated A in Festival speech synthesis Probabilistic system [3] Transformation-based tagger, B Symbolic learning/Rule-based developed by [5] Table 4. Overall success rates System Success rate A 92.05% B 95.17% C 94.23%
tic tags and corresponding pronunciation. It includes 76 different types of ambiguities. The most frequent are adjective/common noun, adjective/verb, and common noun/verb. However, the number of ambiguities that have influence in the Phonetic Analysis module, causing different pronunciations, is only 16. In Table 2 they are presented with the percentage of different word forms of the lexicon with that kind of ambiguity.
4
Experimental Results
To analyze the performance of the developed system, two other taggers were adapted for European Portuguese (table 3) and a comparative evaluation was made. The following tables present the success rates achieved by the taggers. The system presented in Sect. 2 is identified with the letter C. Table 4 shows the overall success rates and Table 5 discriminates the success rate for morphossyntactic descriptions (MSD) that comprehend content words. The best overall success rate was achieved by the transformation-based tagger (B). Concerning the identification of content words, the differences for proper
Using Morphossyntactic Information in TTS Systems
149
Table 5. Success rates achieved in identifying content words MSD Proper noun Common noun Verb Adjective Adverb
A 76.84% 94.73% 90.38% 89.11% 93.12%
B 93.69% 95.24% 96.11% 86.99% 96.52%
C 89.19% 97.07% 96.93% 85.23% 95.06%
Table 6. Error rates obtained for the ambiguities presented in Table 2 Ambiguity A= Nc V= A= Np V= A= V= Cc Nc I R= V= Mc Mo Mc Mo Nc Mo Nc Mo V= Nc Np V= Nc Pd Pp Td Nc R= V= Nc V= Np Xf R= V= S= V=
A 9.96% 0.00% 14.37% 0.19% 18.03% 1.35% 0.40% 0.05% 1.50% 6.86% 4.53% 18.18% 4.85% 0.00% 0.48% 0.79%
B 13.03% 0.00% 11.00% 0.02% 4.92% 0.00% 0.08% 0.05% 0.00% 1.96% 2.47% 1.82% 3.24% 0.00% 0.00% 0.32%
C 10.34% 0.00% 10.70% 0.10% 13.11% 1.35% 0.40% 0.14% 2.40% 9.80% 6.96% 7.27% 2.82% 0.00% 0.00% 0.16%
nouns are not really very significant, since adding new entries to the lexicon will improve this rate. The lower rate obtained for adjectives may be explained by the relative large percentage of adjective/verb in past participle ambiguity. In order to stress the influence of the taggers on the performance of the TTS system, the presented values are error rates. Table 6 further discriminates these error rates in terms of the different kinds of ambiguity relevant for homograph disambiguation. Concerning the influence of part-of-speech tagging in the prosodic processing, we conducted several preliminary studies in the context of the different phrasing methods evaluated in [13]. Our first experiment consisted of computing the percentage of errors in content/function word classification, to which the phrasing algorithms are mostly sensitive. The system A made 1.18% errors, the developed system (C) had error rate of 0.64% and the best result was obtained by the system B. Our second experiment consisted of verb classification, since it is relevant for correctly assigning the pitch contour. The best result was achieved by the system C, failing to identify a verb 3.07% of the times where the system with best overall success rate (B) had an error rate of 3.89%.
150
5
R. Ribeiro, L. Oliveira, and I. Trancoso
Conclusions
This paper reported the work done in the development of a morphossyntactic tagging system for the Portuguese language, an area where the scarce resources still demand for new contributions ([4,12]). The developed system was compared with other taggers that implemented other approaches to this problem and the results were positive (an analysis of some of the available systems for Portuguese can be found in [11]). This study also allowed us to understand what are the disambiguation errors that influence the performance of the TTS system and which are the most relevant ambiguity classes.
References 1. J. Allen. Natural Language Understanding. The Benjamin/Cummings Publishing Company, Inc, 1995. 2. F. Bacelar, J. Bettencourt, P. Marrafa, R. Ribeiro, R. Veloso, and L. Wittmann. LE-PAROLE – Do corpus ` a modeliza¸c˜ ao da informa¸c˜ ao lexical num sistema multifun¸c˜ ao. In Actas do XIII Encontro da APL, Portugal, 1997. 3. A.W. Black, P. Taylor, and R. Caley. The Festival Speech Synthesis System. University of Edimburgh, 1999. 4. A.H. Branco and J.R. Silva. EtiFac: A facilitating tool for manual tagging. In Actas do XVII Encontro Anual da APL, pages 1427–1431, Lisboa, Portugal, 2002. APL e Colibri. 5. E. Brill. Transformation-based error-driven learning and natural language processing. Computational Linguistics, 21(4), 1995. 6. Caroline Hag`ege. Personal communication, 2001. 7. X. Huang, A. Acero, and H. Hon. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, 2001. ´ Laporte. Tratamento das L´ınguas por Computador, chapter Resolu¸c˜ 8. E. ao de ambiguidades. Caminho, 2001. 9. Jos´e Carlos Medeiros. Processamento morfol´ ogico e correc¸c˜ ao ortogr´ afica do portuguˆes. Master’s thesis, Instituto Superior T´ecnico, Portugal, 1995. 10. C. Oravecz and P. Dienes. Efficient stochastic part-of-speech tagging for hungarian. In Proc. of the Third LREC, pages 710–717, Las Palmas, Espanha, 2002. ELRA. 11. Ricardo Ribeiro. Anota¸c˜ ao morfossint´ actica desambiguada do portuguˆes. Master’s thesis, Instituto Superior T´ecnico, Portugal, 2003. 12. T.B. Sardinha. Compila¸c˜ ao e anota¸c˜ ao de um corpus de portuguˆes de linguagem profissional. The Especialist, 21(1):111–147, 2000. 13. M.C. Viana, L.C. Oliveira, and A.I. Mata. Prosodic phrasing: human and machine evaluation. In Proc. of the 4th ISCA Workshop on Speech Synthesis, Scotland, 2001.
Timber! Issues in Treebank Building and Use Diana Santos Linguateca, SINTEF Telecom & Informatics, Pb 1124 Blindern 0314 Oslo, Norway [email protected]
Abstract. We discuss several treebank conceptions in the literature and show that their requirements may be incompatible, describing then the options taken in the construction of a Portuguese treebank, in what concerns human vs. automatic intervention. Use cases are then listed in connection with a Web search tool (Águia), whose philosophy and implementation is presented.
1
Introduction
Treebank building has become fashionable lately with the number of treebank projects growing exponentially. However, there are quite different ways to conceive both the end result and the way to go about achieving it. As far as treebank purpose is concerned, one can identify at least the following different views (an example of each is provided with no claim for exhaustiveness): 1. a treebank as a resource for the building of automatic processing tools [1] 2. a treebank is an evaluation resource to compare the performance of different parsers [2] 3. a treebank is a linguistic resource to fix and display the syntactic analysis of complex text (and can consequently be used for teaching purposes) [3] 4. a treebank is a proof of the qualities of a given theory1 Even though most papers on treebanks so far declare that they expect it to be used for (almost) all these purposes, a closer analysis show that the requirements to achieve these different goals are incompatible or, at least, difficult to harmonize. For example, if you want to train computer programs on the treebank, you’d better only revise and clean information about which there is some understanding on how to program or achieve. In other words, information added by a human drawn from sources such as world knowledge or cognitive processing difficulties, as well as the result of complex inferences based on a distant context are not, in general, reproducible automatically and are therefore of no interest for goal 1.
1
This is rarely stated but it often constitutes an additional motive to engage in treebank building.
N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 151–158, 2003. © Springer-Verlag Berlin Heidelberg 2003
152
D. Santos
In fact, desirable features for a treebank type 1 are: consistency, few information pieces and enough occurrences of each feature (so that systems have enough examples from which to learn). On the other hand, if one wants to create a gold standard for ensuing evaluation endeavours, it is possible that one chooses not to annotate, or not to decide in cases where consensus was not reached. The result may not be consistent or complete, but it is empirically adequate. If one wants to use a treebank for linguistic investigation, one would value most of all the information that only linguists could add, and actually almost “despise” the sort of low-level information that satisfaction of goal 1 would require (like correct morphological information). Consistency would be a platonic goal, but naturalness of the annotation and relevance to linguistic concerns would be features of such a treebank type 3. Finally, a treebank type 4 should maximize diversity (although keeping consistency) in order to prove the expressiveness of the theory and therefore would again fail to be useful for goal 1. Our treebank project, Floresta Sintá(c)tica [4], aimed (eventually) at building a type 3 treebank, given that we had an underlying symbolic parser which provided a lot of information and it was unrealistic to expect that a parser could be trained to learn it all. Reducing it would be a bold decision, which was not taken.
2
Annotation Schemes
Wilson et al. [5] describe a set of desiderata for an annotation scheme where they emphasize that it should reflect distinctions a human could be expected to reliably annotate (“naturalness”). It is easy to find huge numbers of information tags that are not easy to annotate reliably (even though they may be used liberally by parsers); it is also the case that many of the easy to annotate categories for humans are, so far, never even attempted automatically. 2.1 Can Our Treebank Type 3 Be Turned into an Evaluation Treebank (Type 2)? How to create a treebank that allows one to actually evaluate different parsers without forcing the linguistic view of the present treebank? Although we, as creators, might wish that it took the same role as the Penn Treebank [6] for English, used as a de facto standard, we are fully convinced of the need and advantage of cooperatively agreeing on a standard. We believe that the present treebank can be used for experimentation and evaluation, and to make problems and disagreements explicit, but that one should try to build from scratch (or from a much stricter set of rules and using as point of departure the present treebank) a real evaluation resource that allows one to test given aspects of syntactic parsers for Portuguese, probably following Gaizauskas et al.’s proposal [7] for creating evaluation resources quickly, and using some manual analysis as in [8].
Timber! Issues in Treebank Building and Use
153
We are, in any case, convinced that it is totally unrealistic to expect that one can list parsers’ outputs and try to harmonize or agree on the meaning of the different labels. This was already an enormous task for a field as (comparatively) simple as Portuguese morphological analysis, for which an unexpected high degree of disagreement has been reported [9,10]. It is also enough to browse several different Portuguese grammar books to see that they verse about different subjects. Incidentally, it is also quite rare that they define their primitives. 2.1.1 Decisions as to the Process Let us give a concrete example of one of the many things that are far from trivial: The underlying parser – thoroughly described in [11] – assigns the two following syntactic categories to noun phrases attached to noun phrases: N
It has proved, no matter the many heuristics or rules of thumb proposed2, an extremely difficult decision to be done in practice, when one leaves the idealized landscape and comes to real utterances. Time and again there was uncertainty about which classification to assign. Examples are: No final do jogo, adeptos do Sporting lançam garras e pedras para a tribuna de honra, onde estavam Manuela Ferreira Leite, ministra da Educação, e Vítor Vasques, presidente da FPF. Na mesma zona em que foi encontrado o templo, a Alcáçova, a caminho das Portas do Sol, foram ainda descobertas cisternas romanas que estão também a ser objecto de escavações e estudos arqueológicos. Several solutions about how to proceed concerning the assignment of these labels have been proposed, each of them showing, in fact, different conceptions of what a treebank should be for. 1. mark/revise the clear cases and leave the parser’s output when no clear opinion 2. create a new non-committal label (let us call it here npstack) and a. transform all cases of either label into it, or b. use it only for the unclear cases Even though no final decision was (so far) taken, this micro-controversy allows us to illustrate the consequences of each option with respect to the treebank goals mentioned in the beginning of the present paper: The first option was aimed at improving the parser, so that it agreed with human reasoning when humans had something to say. The result would probably not be consistent, and definitely not reflecting human performance, but was obviously ideal for parser improvement.
2
Such as: when an abbreviation follows what it is an abbreviation for, tag it @APP: Partido da Terra (PT); APP implies an identity relation, while N
154
D. Santos
The second one was, on the contrary, aimed at describing human interpretation (and not a parser’s). Option a) had the goal of making the task of building (and consequently revising) the treebank simpler, taking implicitly the view that this is probably not a human task – when we see two NPs following each other, it is not relevant to understand whether the second is APP or N
In the present discussion, we are assuming a dependential framework where features are assigned to words (and functional roles are assigned to head words). The need for upwards and downwards marking remains in a more populated phrase structure formalism, we would just have to say "the clause headed by que," or "the phrase headed by pele".
Timber! Issues in Treebank Building and Use
155
how language works and find out what cannot be predicted from the lexicon, as in surpresa above.
3
Águia
Let us present a Web query tool that has been designed with two considerations in mind: 1. to furnish a higher level query language (in the sense of being as much as possible separate from the encoding realities and actual treebank syntax); 2. to be based on a powerful general purpose corpus system (the IMS CWB) instead of writing from scratch a particular treebank specific query system. This tool is available on the Web (http://www.linguateca.pt/Floresta/) together with a guided tour that tries to give a feeling of the sort of possible queries – as high level as possible. Águia’s more radical (or unusual) feature is that its output is simple text, although the whole treebank is publicly available in its two internal coding formats, and therefore users can, if they want, see and use the tree structure at will. The basis for this feature is that we believe that a treebank user is not (or should not be) primarily concerned with trees, but with the information conveyed by these trees, in order to get at text, to get at language (which comes in the format of words in the written medium). In addition, we are not yet sure about which are the most interesting questions users really want to ask a treebank. Therefore, we implemented also an open window where people can input questions in natural language and we help them to formulate their questions, with the proviso they are answerable by the actual treebank. 3.1
Kinds of Queries
We can distinguish the following kinds of primary uses for a query tool for people (not for programs): The user wants quantitative information about the treebank, such as: What kind of clauses are most frequent? What kind of syntactic objects (phrases) have the function "question", and in what relative weight? What is the most frequent verb in each kind of clause? What is the most common function of a finite clause? In how many cases do adverbs occur in relative clauses? The user wants to inspect some combinations or categories a little better – because s/he suspects they are wrongly assigned, or because they contradict her/his own beliefs about the language. Some (random) examples: How often can crosscategorial conjunctions be found? Are there subject complements with relative clauses? The user may simply want to look for specific examples of special cases, related to his or her field of interest: Find clauses including an adverb as immediate constituent; find noun phrases including relative clauses in which the pronoun has the subject (or object, or dative) role; find finite clauses starting with the verb, etc. The user may also want to look at the underlying generative grammar, according to the examples atested in the treebank: what is the generative grammar of a noun phrase? What is the generation grammar of a particular function?
156
D. Santos
Or the user may be more interested in the lexicon, and want to determine the grammatical properties of a lexical item: what is the valency grammar of a particular lexical item (verb, preposition)? Given a particular class of adverbs, in which patterns do they occur? When a given lexical item occurs as premodifier of a phrase, which functions does this phrase typically show? Above, we showed a variety of different questions which could be answered by a single query with Águia. There is obviously no limit to the complexity of the interaction an experienced user may have with the treebank! We list here other questions that include more than one query but should not be too complicated to answer: What is the deepest embedding? (Find finite clauses under finite clauses.) How many prepositional phrases are not directly attached to the preceding phrase? How many noun phrases exhibit a potential attachment ambiguity? Still, other metalinguistic questions, at the moment not catered for by Águia, but encoded in the treebank, can be asnwered: Which sentences were considered ambiguous in the treebank? Which utterances required world knowledge for disambiguation? (see examples in [14]). Which clauses involve ellipsis or required insertion of additional material in order to be parsed and represented by the human team? 3.2
Use of IMS CWB
The use of the underlying IMS CWB [15–17] is an obviously sound engineering decision, since it offers a well developed and tested set of capabilities, a powerful query language and several utilities. In addition, we believe that there should be, at least from a user point of view, a smooth transition between POS-tagged and annotated corpora, and the fact that the codification of the latter may pose complex problems to the language engineer should be transparent to the user. The way we used the IMS CWB was straightforward but somehow imaginative: we created several different physical corpora from the manually edited output, that code the treebank in different ways. Depending on the query, the right corpus is used. This is, however, perfectly transparent for the user, who can only distinguish between the manually revised part (Bosque, the treebank proper) and the larger automatically produced part (Floresta Virgem, “the treebank to be”). For example, we present an extract of one of the corpora in Fig. 1, having words as terminals and phrases as structural attributes, and therefore appropriate to look for words inside phrases, while the corpus of Fig. 2 has phrases as terminals and words as attributes. While it is outside the scope of the present paper to dwell on technicalities, this small section should be read as a plea for using already existing powerful tools for dealing with large amounts of linguistically analysed text, instead of reinventing the wheel and create new treebank search tools from scratch, as was e.g. done in the TIGER project [18]. We conclude the present paper asking everyone interested in Portuguese syntax to look at Floresta Sintá(c)tica and try out Águia for the questions they are more interested in, so that we can have a representative idea of the shortcomings and the main user needs, and may be able to develop a tool that can be generally used, also
Timber! Issues in Treebank Building and Use
157
later on, for different treebanks for Portuguese (and even other languages, if the concept turns out to be pertinent).
<s>
pont
0
1
SUB:conj-s
0
2
AUX:v-fin MV:v-pcp
FUT_3S_SUBJ M_S 3
pont 0 1 SUBJ:pron-indp M_S P:v-fin FUT_3S_IND
1 1
>A:adv
M_S
COM:conj-s SUBJ:pron-pers
0 3 M/F_1P_NOM/PIV
pont
1
0
3
2 3
Fig. 1. One of the views of the treebank encoded in the IMS-CWB vp P 'v-fin v-pcp ' 'AUX MV ' "for firmado " 2 fcl ADVL 'conj-s vp2 ' 'SUB P ' "Se for firmado " 3 acl KOMP< 'conj-s pron-pers ' 'COM SUBJ ' "do_que nós " 2 ap SC 'adv adj acl2 ' '>A H KOMP<' "mais contente do_que nós" 4 fcl STA 'fcl1 pron-indp v-fin ap1 ' 'ADVL SUBJ P SC ' "“ Se for firmado , ninguém ficará mais contente do_que nós . " 12
Fig. 2. Another view of the treebank encoded in the IMS-CWB
References 1.
2.
Marcus, Mitchell, Kim, Grace, Marcinkiewicz, Mary Ann, MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen, Schasberger, Britta: The Penn treebank: Annotating predicate argument structure. In: Proceedings of the 1994 Human Language Technology Workshop (ARPA) (1994) 110–115. Xia, Fei, Palmer, Martha, Xue, Nianwen, Okurowski, Mary Ellen, Kovarik, John, Chiou, Fu-dong, Huang, Shizhe, Kroch, Tony, Marcus, Mitch: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In: Gavriladou, M. et al. (eds.), Proceedings of LREC 2000 (2000) 3–10.
158 3.
4.
5.
6.
7.
8.
9.
10. 11. 12.
13.
14.
15.
16. 17.
18.
D. Santos Skut, Wojciech, Brants, Thorsten, Krenn, Brigitte, Uszkoreit, Hans: A Linguistically Interpreted Corpus of German Newspaper Text. In: Rubio, A. et al. (eds.), Proceedings of LREC 1998 (1998) 705–711 Afonso, Susana, Bick, Eckhard, Haber, Renato, Santos, Diana: "Floresta sintá(c)tica": a treebank for Portuguese. In: Rodríguez, M.G., Araujo, C.P.S. (eds.): Proceedings of LREC 2002 (2002), 1698–1703 Wilson, G., Mani, I., Sundheim, B., Ferro, L.: A multilingual approach to annotating and extracting temporal information. In: Proceedings of the Worskhop for Temporal and Spatial Information Processing (Toulouse, July 7th 2001) (2001) 81–87 Marcus, Mitchell P., Santorini, Beatrice, Marcinkiewicz, Mary Ann: Building a large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19 (1993) 313–330 Gaizauskas, R., Hepple, M., Huyck, C. Modifying Existing Annotated Corpora for General Comparative Evaluation of Parsing. In: Workshop on Evaluation of Parsing Systems, at the LREC'98 (1998) Carroll, John, Minnen, Guido, Briscoe, Ted: Corpus annotation for Parser Evaluation. In: Uszkoreit, H. et al (eds.): Proceedings of LINC-99: Linguistically Interpreted Corpora, EACL (Bergen, 12 June 1999) (1999) 35–41 Santos, Diana, Rocha, Paulo: AvalON: uma iniciativa de avaliação conjunta para o português. In: Actas do XVIII Encontro da Associação Portuguesa de Linguística (Porto, 2-4 de Outubro de 2002) (2003) Santos, Diana, Costa, Luís, Rocha, Paulo: Cooperatively evaluating Portuguese morphology. In: this volume (2003) Bick, Eckhard: The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus University Press (2000) Santos, Diana, Gasperin, Caroline: Evaluation of parsed corpora: experiments in usertransparent and user-visible evaluation. In Rodríguez, M.G.; Araujo, C.P.S. (eds.): Proceedings of LREC 2002 (2002) 597–604 Afonso, Susana: Clara e sucintamente: Um estudo em corpus sobre a coordenação de advérbios em –mente. In: Actas do XVIII Encontro da Associação Portuguesa de Linguística (Porto, 2-4 de Outubro de 2002) (2003) Afonso, Susana, Bick, Eckhard, Haber, Renato, Santos, Diana: Floresta sintá(c)tica: um treebank para o português. In: Gonçalves, Anabela, Correia, Clara Nunes (eds.): Actas do XVII Encontro da Associação Portuguesa de Linguística (Lisboa, 2–4 de Outubro de 2001) (2002) 533–545 Christ, Oliver: A modular and flexible architecture for an integrated corpus query system. In: Proceedings of COMPLEX'94: 3rd Conference on Computational Lexicography and Text Research (1994) 23–32 Evert, Stefan: CQP Query Language Tutorial. IMS Stuttgart, 13 Out 2001 Evert, Stefan; Kermes, Hannah: Annotation, storage, and retrieval of mildly recursive structures. In: Proceedings of the Workshop on Shallow Processing of Large Corpora (SProLaC 2003) (2003) König, Esther, Lezius, Wolfgang: A description language for syntactically annotated corpora. In: Proceedings of COLING 2000 (2000) 1056–1060
A Lexicon-Based Stemming Procedure Gilberto Silva1 and Claudia Oliveira2 1
Datasus – Centro de Tecnologia da Informação do Ministério da Saúde Rua México, 128, 7o andar, Rio de Janeiro, Brazil [email protected] 2 Departamento de Engenharia de Computação, Instituto Militar de Engenharia Praça General Tibúrcio, 80, Rio de Janeiro, Brazil [email protected]
Abstract. This paper describes a stemming technique that depends principally on a target language’s lexicon, organised as an automaton of word strings. The clear distinction between the lexicon and the procedure itself allows the stemmer to be customised for any language with little or even no changes to the program’s source code.
1
Introduction
One of the main functionalities of a Text Retrieval System (TRS) should be its ability to answer queries, possibly formulated by means of keywords, about the word content of documents in a collection of text documents. Exact word by word matching between the keywords and the text contents often excessively restricts the set of retrieved documents, which is the main reason for using a measure of similarity between words rather than strict equality. In linguistics, stem is a form that unifies the elements in a set of morphologically similar words [3], therefore stemming is the operation which determines the stem of a given word. A TRS equipped with a stemmer extracts stems, not words, from the indexed text documents as well as the queries; results are based on stem comparisons. The main objective of this work is to describe a stemming technique that depends principally on an external target language’s lexicon, in contrast to procedures that enbody a morphological theory of a specific language. The paper is organised as follows: Sect. 2 presents a brief review of the mainstream stemming methods; in Sect. 3 we detail the proposed stemming procedure, describe the required organisation of the target language’s lexicon as an automaton of word strings, and describe an experiment with a prototype Portuguese lexicon; and in Sect. 4 we draw some final remarks.
2
Stemming Methods
The simplest and most obvious of the stemming methods consists of storing and searching for word-stem pairs in a table. The essential feature of the table look-up N.J. Mamede et al. (Eds.): PROPOR 2003, LNAI 2721, pp. 159–166, 2003. © Springer-Verlag Berlin Heidelberg 2003
160
G. Silva and C. Oliveira
method is the data structure, which must be extremely efficient, such as a B-tree, a hash table or an acyclic finite automaton. Its main drawback is the fact that the lexicons of natural languages are open sets. Even if all the actual words of the lexicon could be stored at a given moment, which seems to be a very unfeasible task, the table would soon be obsolete, given the dynamics of a real lexicon. The most widely used of the stemming techniques are affix stripping procedures, following a model introduced by Lovins [7] known as the iterative longest match stemmers. The basic idea is that word endings, which are considered to be affixes in the target language, are iteractively substituted by other affixes according to predetermined rules, until the resulting word form does not contain a recognised affix. Following Lovins, other iterative longest match stemmers were proposed in [12,4,9,8]. The most widely used of these is Porter's stemmer which, possibly due to its simplicity and good performance, has become a standard in TRS systems worldwide, notwithstanding its original English specificity. The stemming method proposed by Hafer and Weiss [6], the successor variety method, takes into account the number of possible letters that could follow a given prefix substring of a word, in the context of a given corpus. According to the method, a word string is analysed from left to right: the successor variety decreases as the size of the prefix increases, until it reaches the form of a word or word root, at which point the successor variety increases steeply. This prefix is considered to be the word stem. Four approaches are proposed for choosing the final stem from a set of possibilities: the cutoff method, the peak and plateau method, the complete word method and the entropy method. For further details the reader should refer to [6] or [5]. Adamson and Boreham [1] present a method called the shared digram method. A digram is a substring of size two in a string, which could be generalised as an n-gram, for an arbitrary size. Strictly speaking, this method does not constitute a stemming method, since the result is not a stem. Nevertheless, its purpose is the evaluation of the degree of similarity between words, therefore the method is presented here. The main idea is that the measure of similarity between two words is calculated as a function of the numbers of distinct n-grams that they have in common. For example, considering the words statistics and statistical we observe that the former has 7 distinct digrams and the latter has 8. They share 6 digrams: at, ic, is, st, ta, ti. From this analysis, the similarity Sij between two words is given by Dice’s coefficient, defined as Sij = 2Dij / (Di + Dj), where Di and Dj are the numbers of distinct digrams in each word and Dij is the number of distinct digrams they share. In the example, the similarity between statistics and statistical is given by (2 × 6) ÷ (7 + 8) = 0.8, or 80%.
3
The Proposed Lexicon-Based Stemming Procedure
The stemming method we propose in this work combines two of the approaches presented in Sect. 2: affix stripping and table look-up. The affix stripping rules, as well as the exceptions to these rules, are uniformly stored in a table. This table is effectively the representation of a lexicon, stored in a minimised deterministic finite automaton, which garnets two essential requirements of the stemming algorithm. Firstly, the used memory space is manageable and, even in hardware systems of mod-
A Lexicon-Based Stemming Procedure
161
est capabilities, the lexicon can be kept in RAM, avoiding disk access. Secondly, access time is a linear function of the length of the stored string. Our priority is the separation of the computational procedure, which is a generic language independent stemmer, from the specific lexicon representations. 3.1
The Structure of the Lexicon
Summarising the views of Aronoff and Anshem [2], morphology is the set of word formation processes, which determines the potential complex words of a given language. On the other hand, the lexicon is the inventory of existing words of a language. Therefore, morphology and the lexicon are interdependent and complementary with respect to their function of providing words with which the speaker may construct utterances. The role of stemming is to reduce a group of words to a central form, the stem, which may carry a significant portion of the words' meanings. Even though the procedure can be seen as an implementation of some language's morphology, there are other requirements that have to be met for it to be part of a TRS. According to [9], with regard to his affix stripping algorithm, “... the affixes are being removed simply to improve IR performance, and not as a linguistic exercise”. Although the stemming method we propose can be seen as a combination of affix stripping and table look-up, there are two important distinctions which must be made clear. First, affix stripping algorithms implement a relatively small list of very general rules, empirically created by the authors of the algorithms. In contrast, our rules are the result of semi-automatic manipulation of word lists, which normally results in an extensive set of rules. Secondly, in affix stripping algorithms the rules are an integral part of the program. As an alternative, we chose to store the rules and the exceptions in a separate structure, where each element represents a lexical entry of the form