Knowledge Mining

Spiros Sirmakessis (Ed.) Knowledge Mining Studies in Fuzziness and Soft Computing, Volume 185 Editor-in-chief Prof. Ja...

Author: Spiros Sirmakessis

59 downloads 2046 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Spiros Sirmakessis (Ed.) Knowledge Mining

Studies in Fuzziness and Soft Computing, Volume 185 Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: [email protected] Further volumes of this series can be found on our homepage: springeronline.com Vol. 169. C.R. Bector, Suresh Chandra Fuzzy Mathematical Programming and Fuzzy Matrix Games, 2005 ISBN 3-540-23729-1 Vol. 170. Martin Pelikan Hierarchical Bayesian Optimization Algorithm, 2005 ISBN 3-540-23774-7 Vol. 171. James J. Buckley Simulating Fuzzy Systems, 2005 ISBN 3-540-24116-7 Vol. 172. Patricia Melin, Oscar Castillo Hybrid Intelligent Systems for Pattern Recognition Using Soft Computing, 2005 ISBN 3-540-24121-3 Vol. 173. Bogdan Gabrys, Kauko Leiviskä, Jens Strackeljan (Eds.) Do Smart Adaptive Systems Exist?, 2005 ISBN 3-540-24077-2 Vol. 174. Mircea Negoita, Daniel Neagu, Vasile Palade Computational Intelligence: Engineering of Hybrid Systems, 2005 ISBN 3-540-23219-2 Vol. 175. Anna Maria Gil-Lafuente Fuzzy Logic in Financial Analysis, 2005 ISBN 3-540-23213-3 Vol. 176. Udo Seiffert, Lakhmi C. Jain, Patric Schweizer (Eds.) Bioinformatics Using Computational Intelligence Paradigms, 2005 ISBN 3-540-22901-9

Vol. 177. Lipo Wang (Ed.) Support Vector Machines: Theory and Applications, 2005 ISBN 3-540-24388-7 Vol. 178. Claude Ghaoui, Mitu Jain, Vivek Bannore, Lakhmi C. Jain (Eds.) Knowledge-Based Virtual Education, 2005 ISBN 3-540-25045-X Vol. 179. Mircea Negoita, Bernd Reusch (Eds.) Real World Applications of Computational Intelligence, 2005 ISBN 3-540-25006-9 Vol. 180. Wesley Chu, Tsau Young Lin (Eds.) Foundations and Advances in Data Mining, 2005 ISBN 3-540-25057-3 Vol. 181. Nadia Nedjah, Luiza de Macedo Mourelle Fuzzy Systems Engineering, 2005 ISBN 3-540-25322-X Vol. 182. John N. Mordeson, Kiran R. Bhutani, Azriel Rosenfeld Fuzzy Group Theory, 2005 ISBN 3-540-25072-7 Vol. 183. Larry Bull, Tim Kovacs (Eds.) Foundations of Learning Classiﬁer Systems, 2005 ISBN 3-540-25073-5 Vol. 184. Barry G. Silverman, Ashlesha Jain, Ajita Ichalkaranje, Lakhmi C. Jain (Eds.) Intelligent Paradigms for Healthcare Enterprises, 2005 ISBN 3-540-22903-5 Vol. 185. Spiros Sirmakessis (Ed.) Knowledge Mining, 2005 ISBN 3-540-25070-0

Spiros Sirmakessis (Ed.)

Knowledge Mining Proceedings of the NEMIS 2004 Final Conference

ABC

Dr. Spiros Sirmakessis Research Academic Computer Technology Institute 61 Riga Feraiou Str. 26221 Patras Greece Email: [email protected]

Library of Congress Control Number:

ISSN print edition: 1434-9922 ISSN electronic edition: 1860-0808 ISBN-10 3-540-25070-0 Springer Berlin Heidelberg New York ISBN-13 978-3-540-25070-8 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springeronline.com c Springer-Verlag Berlin Heidelberg 2005 Printed in The Netherlands The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: by the authors and TechBooks using a Springer LATEX macro package Printed on acid-free paper

SPIN: 11384878

89/TechBooks

543210

Preface

Text mining is an exciting application ﬁeld and an area of scientiﬁc research that is currently under rapid development. It uses techniques from well-established scientiﬁc ﬁelds (e.g. data mining, machine learning, information retrieval, natural language processing, case-based reasoning, statistics and knowledge management) in an eﬀort to help people gain insight, understand and interpret large quantities of (usually) semi-structured and unstructured data. Despite the advances made during the last few years, many issues remain unresolved. Proper co-ordination activities, dissemination of current trends and standardisation of the procedures have been identiﬁed, as key needs. There are many questions still unanswered, especially to the potential users; what is the scope of Text Mining, who uses it and for what purpose, what constitutes the leading trends in the ﬁeld of Text Mining – especially in relation to IT – and whether there still remain areas to be covered. Knowledge Mining draws upon many of the key concepts of knowledge management, data mining and knowledge discovery, meta-analysis and data visualization. Within the context of scientiﬁc research, knowledge mining is principally concerned with the quantitative synthesis and visualization of research results and ﬁndings. The results of knowledge mining are increased scientiﬁc understanding along with improvements in research quality and value. Knowledge mining products can be used to highlight research opportunities, assist with the presentation of “best” scientiﬁc evidence, facilitate research portfolio management, as well as, facilitate policy setting and decision making. The NEMIS project, funded by the IST framework, was set out to create a network of excellence (NoE), which will bring together experts in the ﬁeld of Text Mining to explore the grey areas relating to the status, trends and possible future developments in the technology, practices and uses of Text Mining. The NEMIS Conference on “Knowledge Mining” maintains a balance between theoretical issues and descriptions of case studies to promote synergy between theory and practice. Topics of interest include, but are not limited to:

VI

Preface

Document processing & visualization techniques • • • • •

Document Representation & Storage Metadata Production Document Classiﬁcation/Clustering Content Analysis Visualization Techniques

Web mining • • • • • •

Web Content, Structure & Usage Mining User Behaviour Modelling Machine Learning applied on the web Personalized Views Semantic Web Mining Ontologies

TM & knowledge management: Theory & applications • • • • •

Customer Relationship Management Technology Watch Patent Analysis Statistical Analysis of Textual Data Comparative Analysis of TM tools

User aspects & relations to Oﬃcial Statistics • Structures & Applications for Searching and Organising Metadata • Discovery of Updates in Statistical Databases and Publishing Systems • Tools and Applications for Tracing and Enumerating Oﬃcial Statistics in Electronic mass media I would like to express my appreciation to all authors of submitted papers, to the members of the program committee and all the people that have worked for this event. This conference could not have been held without the outstanding eﬀorts of Eleni Rigou at the Conference Secretariat. Finally, recognition and acknowledgement is due to all members of the Internet and Multimedia Research Unit at Research Academic Computer Technology Institute. May 2005

Dr Spiros Sirmakessis Assistant Professor

Contents

Knowledge Mining: A Quantitative Synthesis of Research Results and Findings Penelope Markellou, Maria Rigou, and Spiros Sirmakessis . . . . . . . . . . . . .

1

An Evidential Approach to Classiﬁcation Combination for Text Categorisation D.A. Bell, J.W. Guan, and Y.X. Bi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Visualization Techniques for Non Symmetrical Relations Simona Balbi and Michelangelo Misuraca . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Understanding Text Mining: A Pragmatic Approach Sergio Bolasco, Alessio Canzonetti, Federico M. Capo, Francesca della Ratta-Rinaldi, and Bhupesh K. Singh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Novel Approaches to Unsupervised Clustering Through k-Windows Algorithm D.K. Tasoulis and M.N. Vrahatis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Semiometric Approach, Qualitative Research and Text Mining Techniques for Modelling the Material Culture of Happiness Furio Camillo, Melissa Tosi, and Tiziana Traldi . . . . . . . . . . . . . . . . . . . . . 79 Semantic Distances for Sets of Senses and Applications in Word Sense Disambiguation Dimitrios Mavroeidis, George Tsatsaronis, and Michalis Vazirgiannis . . . 93 A Strategic Roadmap for Text Mining Georgia Panagopoulou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Text Mining Applied to Multilingual Corpora Federico Neri and Remo Raﬀaelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

VIII

Contents

Content Annotation for the Semantic Web Thierry Poibeau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 An Open Platform for Collecting Domain Speciﬁc Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos . . . . . . . . . . . . . . . . . 147 Extraction of the Useful Words from a Decisional Corpus. Contribution of Correspondence Analysis M´ onica B´ecue-Bertaut, Martin Rajman, Ludovic Lebart, and Eric Gaussier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Collective SME Approach to Technology Watch and Competitive Intelligence: The Role of Intermediate Centers Jorge (Gorka) Izquierdo and Sergio Larreina . . . . . . . . . . . . . . . . . . . . . . . . 181 New Challenges and Roles of Metadata in Text/Data Mining in Statistics ˇ es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Duˇsan Solt´ Using Text Mining in Oﬃcial Statistics Alf Fyhrlund, Bert Fridlund, and Bo Sundgren . . . . . . . . . . . . . . . . . . . . . . . 201 Combining Text Mining and Information Retrieval Techniques for Enhanced Access to Statistical Data on the Web: A Preliminary Report Martin Rajman and Martin Vesely . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Comparative Study of Text Mining Tools Antoine Spinakis and Asanoula Chatzimakri . . . . . . . . . . . . . . . . . . . . . . . . . 223 Some Industrial Applications of Text Mining Bernd Drewes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Using Text Mining Tools for Event Data Analysis Theoni Stathopoulou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Terminology Extraction: An Analysis of Linguistic and Statistical Approaches Maria Teresa Pazienza, Marco Pennacchiotti, and Fabio Massimo Zanzotto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Analysis of Biotechnology Patents Antoine Spinakis and Asanoula Chatzimakri . . . . . . . . . . . . . . . . . . . . . . . . . 281

Knowledge Mining: A Quantitative Synthesis of Research Results and Findings Penelope Markellou, Maria Rigou, and Spiros Sirmakessis Research Academic Computer Technology Institute, 61 Riga Feraiou Str., 26221 Patras, Greece {markel, rigou, syrma}@cti.gr http://www.ru5.cti.gr Abstract. Knowledge mining emerged as a rapidly growing interdisciplinary ﬁeld that merges together databases, statistics, machine learning and related areas in order to extract valuable information and knowledge in large volumes of data. In this paper we present the key ﬁnding of the results achieved during the NEMIS Conference on “Knowledge Mining”.

1 Introduction Knowledge Discovery merges together databases, statistics, machine learning and related areas in order to discover information and knowledge in large volumes of data. In the past two decades, all organizations have collected huge amounts of data in their databases. These organizations need to understand their data and/or to discover useful knowledge as patterns and/or models from their data [1]. In general, data can be seen as a string of bits, or numbers and symbols, or “objects”. We use bits to measure information, and see it as data stripped of redundancy, and reduced to the minimum necessary to make the binary decisions that essentially characterize the data (interpreted data). We can see knowledge as integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. In other words, knowledge can be considered data at a high level of abstraction and generalization. There is a diﬀerence in understanding the terms “knowledge discovery” and “data mining” between people from diﬀerent areas contributing to this new ﬁeld. During this conference we examined the issues of knowledge discovery in diﬀerent applications. Knowledge discovery in databases is the process of identifying valid, novel, potentially useful, and ultimately understandable patterns/models in data [1]. Data mining is a step in the knowledge discovery process consisting of particular data mining algorithms that, under some acceptable computational eﬃciency limitations, ﬁnds patterns or models in P. Markellou et al.: Knowledge Mining: A Quantitative Synthesis of Research Results and Findings, StudFuzz 185, 1–11 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

2

P. Markellou et al.

data. In other words, the goal of knowledge discovery and data mining is to ﬁnd interesting patterns and/or models that exist in databases, but are hidden among the volumes of data. The process of knowledge discovery inherently consists of several steps as shown in Fig. 1 [1].

Identify & Define Problem

Obtain & Preprocess Data

DATA MINING Extract Knowledge

Use Discovered Knowledge

Interpret & Evaluate Results

Fig. 1. The Knowledge Discovery Process

The ﬁrst step is to understand the application domain and to formulate the problem. This step is clearly a prerequisite for extracting useful knowledge and for choosing appropriate data mining methods in the third step according to the application target and the nature of data. The second step is to collect and preprocess the data, including the selection of the data sources, the removal of noise or outliers, the treatment of missing data, the transformation (discretization if necessary) and reduction of data, etc. This step usually takes the most time needed for the whole KDD process. The third step is data mining that extracts patterns and/or models hidden in data. A model can be viewed “a global representation of a structure that summarizes the systematic component underlying the data or that describes how the data may have arisen”. In contrast, “a pattern is a local structure, perhaps relating to just a handful of variables and a few cases”. The fourth step is to interpret discovered knowledge, especially the interpretation in terms of description and prediction the two primary goals of discovery systems in practice. Experiments show that discovered patterns or models from data are not always of interest or direct use, and the Knowledge Discovery process is necessarily iterative with the judgment of discovered knowledge. The ﬁnal step is to put discovered knowledge in practical use. In some cases, one can use discovered knowledge without embedding it in a computer system. Otherwise, the user may expect that discovered knowledge can be

Knowledge Mining

3

put on computers and exploited by some programs. Putting the results into practical use is certainly the ultimate goal of knowledge discovery. This paper is organised as follows; Sect. 2 presents a synthesis of the research results presented during the conference. Applications on Knowledge Mining can be found in Sect. 3. Case studies are the theme of Sect. 3.

2 Knowledge Mining: A Synthesis of Research Results Despite its theoretical backbone, the conference focused in applications used for knowledge discovery. During the one day conference several new approaches have been presented. The conference covered diﬀerent areas in the knowledge mining sector. David Bell presented in [1] an evidential approach to classiﬁcation combination for text categorisation. The authors reviewed well deﬁned classiﬁcation methods like the Support Vector Machine [3], kNN (nearest neighbours) [4], kNN model-based approach (kNNM) [5], and Rocchio methods [6]. In the method described in their work, they combine these classiﬁers. A previous study suggested that the combination of the best and the second best classiﬁers using evidential operations [7] can achieve better performance than other combinations. They asses some aspects of this from an evidential reasoning perspective and suggest a reﬁnement of the approach. This is an extension of a novel method and technique for representing outputs from diﬀerent classiﬁers – a focal element triplet – to a focal element quartet, and uses an evidential method for combining multiple classiﬁers based on this new structure. The structure, and the associated methods and techniques developed in this research are particularly useful for data analysis and decision making under uncertainty. They can be applied in many decision making exercises which knowledge and information are insuﬃcient and incomplete. Many strategies of Text Retrieval are based on Latent Semantic Indexing and its variations, by considering diﬀerent weighting systems for words and documents. Correspondence Analysis and L.S.I. [8] share the basic algebraic tool, i.e. the Singular Value Decomposition and its generalisation, related to the use of a diﬀerent way for measuring the importance of each element, both in determining and representing similarities between documents and words. The aim of the paper was to propose a peculiar factorial approach for better visualizing the relations between textual data and documents, compared with classical Correspondence Analysis. Simona Balbi and Michelangelo Misuraca in [3] consider a term frequency/document frequency index scheme, mainly developed for Text Retrieval, in a textual data analysis context. An application on Italian Le Monde Diplomatique corpus (about 2000 articles published from 1998 to 2003 in the Italian edition of LMD) shows the eﬀectiveness of their approach. Further developments can be achieved by considering the introduction of a proper weighted Euclidean metric in the sub-space spanned by documents, for visualizing words association. Moreover, in order to better

4

P. Markellou et al.

understand the relations between the documents and the language used, the development of more powerful graphical tools in textual data analysis need to be studied, in the frame of a Visual Text Mining. The state of the art of the main TM applications was the theme of Bolasco et al. [10]. In order to accomplish this task, a two-step strategy has been pursued: ﬁrst of all, some of the main European and Italian companies oﬀering TM solutions were contacted, in order to collect information on the characteristics of the applications; secondly, a detailed search on the web was made to collect further information about users or developers and applications. On the basis of the material collected, a synthetic grid was built to collocate, from more than 300 cases analysed, the 100 ones that we considered most relevant for the typology of function and sector of activity. The joint analysis of the diﬀerent case studies has given an adequate picture of TM applications according to the possible types of results that can be obtained, the main speciﬁcations of the sectors of applications and the type of functions. At the end, it was possible to classify the applications matching the level of customisation (followed in the tools development) and the level of integration (between users and developers). This matching produced four diﬀerent situations: standardisation, outsourcing, internalisation, and synergism. Tasoulis and Vrahatis in [11] present novel approaches to unsupervised clustering through the k-windows algorithm [12]. Clustering algorithms are typically employed to identify groups (clusters) of similar objects. A critical issue for any clustering algorithm is the determination of the number of clusters present in a dataset. In this contribution they presented a clustering algorithm that in addition to partitioning the data into clusters, it approximates the number of clusters during its execution. Further modiﬁcations of this algorithm for diﬀerent distributed environments and dynamic databases are available at [11]. The last years there has been an increasing interest both from the Information Retrieval community and the Data Mining community in investigating possible advantages of using Word Sense Disambiguation (WSD) [13] In [14, 15] the results presented were negative, though probably because in [15] the WSD process applied did not assign a single sense to each word, but tackled all the possible senses for all the words, while in [14] the semantic relations, like the hypernym/hyponym relation, were not taken into account. In contrast, in [16, 17], a rich representation for senses was utilized, that exploited the semantic relations between senses, as provided by WordNet1 . Thus, there exist indications that the correct usage of senses can improve accuracy in Data Mining tasks. In general a WSD process can be either supervised or unsupervised (or a combination of the two). The supervised WSD considers a pre-tagged text corpus that is used as a training set. The sense of a new keyword can then be inferred based on the hypothesis generated by the training set. In [18] you can ﬁnd a proposal for two methods for calculating 1

http://www.cogsci.princeton.edu/∼wn/

Knowledge Mining

5

the semantic distance of a set of senses in a hierarchical thesaurus and utilize them for performing unsupervised WSD. A roadmap is typically a time-based plan that deﬁnes the present state, the state we want to reach and the way to achieve it. This includes identiﬁcation of exact goals and the development of diﬀerent routes for achieving them. In addition, it provides guidance to focus on the critical issues that are needed in order to meet these objectives. The roadmap presented in [19] aims at preparing the ground for future Text Mining RTD activities by investigating future research challenges and deﬁning speciﬁc targets. For the development of this roadmap a “scenario-driven approach” has been used, meaning that several scenarios for potential future applications concerning Text Mining have been developed. These scenarios were used to reﬂect emerging user needs, combine them with key technologies and provide a snapshot of the future. The produced roadmap has also shown possible ways of realising these scenarios and identiﬁed the directions for future technology evolution. A comparison analysis of the available text mining tools was presented in [20] The basic stages of the overall comparison process are described, together with the speciﬁed evaluation criteria. In general, the functionalities oﬀered by each of the reviewed software may cover fully the needs of a given text analysis project. The suggestion for the implementation of single software that is suitable for all types of text mining projects does not yield as a conclusion of the comparison study. Each system serves speciﬁc objectives and has its own identity. However, some development standards such as textual data management or export formats might be useful to be applied. In addition, the adaptation of standard terminology among the existing text mining tools could be a step of further improvement. Are linguistic properties and behaviors important to recognize terms? Are statistical measures eﬀective to extract terms? Is it possible to capture a sort of term hood with computation linguistic techniques? Or maybe, terms are too much sensitive to exogenous and pragmatic factors that cannot be conﬁned in computational linguistic? All these questions are still open. The study presented by Maria Teresa Pazienza, Marco Pennacchiotti, and Fabio Massimo Zanzotto in [21] tries to contribute in the search of an answer, with the belief that it can be found only through a careful experimental analysis of real case studies and a study of their correlation with theoretical insights.

3 Applications for Knowledge Mining: A Synthesis of R&D Results Up to 80% of electronic data is textual and most valuable information is often encoded in pages which are neither structured, nor classiﬁed. Documents are – and will be – written in various native languages, but these documents are relevant even to non-native speakers. Nowadays everyone experiences a mounting frustration in the attempt of ﬁnding the information of interest,

6

P. Markellou et al.

wading through thousands of pieces of data. The process of accessing all these raw data, heterogeneous for language used, and transforming them into information is therefore inextricably linked to the concepts of textual analysis and synthesis, hinging greatly on the ability to master the problems of multilinguality. Through Multilingual Text Mining, users can get an overview of great volumes of textual data having a highly readable grid, which helps them discover meaningful similarities among documents and ﬁnd all related information. The work presented by SYNTHEMA [22] describes the approach used by for Multilingual Text Mining, showing the classiﬁcation results on around 600 breaking news written in English, Italian and French. Terminologies and Translation Memories permit to overcome linguistic barriers, allowing the automatic indexation and classiﬁcation of documents, whatever it might be their language. This new approach enables the research, the analysis, the classiﬁcation of great volumes of heterogeneous documents, helping people to cut through the information labyrinth. Being multilinguality an important part of this globalised society, Multilingual Text Mining is a major step forward in keeping pace with the relevant developments in the challenging and rapidly changing world. Future Concept Lab illustrated in [12] how the use of interactive digital material can be relevant to analyse qualitative and quantitative data in a participatory and creative manner. They focused on the additional value of presenting data in an interactive and ﬂexible way by using a two-ways insight matrix and a word mapping statistical technique called Semiometrie. In order to exemplify their usage, they presented a research (The Material Culture of Happiness’) based on the collection and the analysis of photo diaries coming from Spain, France, England, Germany, Italy, The Netherlands, Finland and Russia. The platform design based on the methodology proposed for web information retrieval and extraction in the context of the R&D project CROSSMARC2 was presented in [24]. The platform facilitates the use of tools for collecting domain speciﬁc web pages as well as for extracting information from them. It also supports the conﬁguration of such tools to new domains and languages. The platform provides a user friendly interface through which the user can specify the domain speciﬁc resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various conﬁgurations. The contribution of SAS Institute on industrial applications of Text Mining was included in [25]. Three industrial applications of text mining presented requiring diﬀerent methodologies. The ﬁrst application used a classiﬁcation approach in order ﬁlter documents relevant for personal proﬁles from an underlying document collection. The second application combines cluster analysis with statistical trend analysis in order to detect emerging issues in manufacturing. In the third application a combination of static term indexing 2

http://www.iit.demokritos.gr/skel/crossmarc

Knowledge Mining

7

and dynamic singular value computation is used to drive similarity search in a large document collection. All of these approaches require a knowledgeable human to be part of the process; the goal is not an automatic knowledge understanding but using text mining technology in order to enhance the productivity of existing business processes. Thierry Poibeau in [26] presented the way an information extraction system can be recycled to produce RDF schemas for the semantic web [27]. He demonstrated that this kind of systems must respect operational constraints like the fact that the information produced must be highly relevant (high precision, possibly bad recall). The production of explicit structured data on the web will lead a better relevance of information retrieval engines In the recent years there is a tremendous increase in the number of actors in the statistical arena in terms of producers, distributors, and users due to the new options of the web technology. These actors are not suﬃciently informed about the technological progress made in the ﬁeld of text mining and the ways in which they can beneﬁt from these. Several applications are needed in the world of production and dissemination of oﬃcial statistics. Examples of such applications might be advanced querying of document warehouses at websites, analysing, processing and coding the answers to open-ended questions in questionnaire data, sophisticated access to internal and external sources of statistical metainformation, or to “pull” statistical data and metadata from the web sites of sending institutions [28].

4 Knowledge Mining Theory in Practise: Demonstration of Text Mining Case Studies Text mining techniques were used to improve the consultation of jurisprudence textual databases in the case study presented in [29]. The data used for the case study consists of a corpus of 430 legal sentences issued by the Spanish Tribunal Supremo (Supreme Court), and relative to prostitution oﬀences, from 1979 to 1996. The authors focused on correspondence analysis (CA) techniques, but also provide some insights on similar visualization techniques, such as self organizing maps (Kohonen maps), and review the potential impact of various Natural Language pre-processing techniques. CA is described in more detail, as well as its use in all the steps of the analysis. A concrete example is provided to illustrate the value of the results obtained with CA techniques for an enhanced access to the studied jurisprudence corpus. Technology Watch (TW) and Competitive Intelligence (CI) are important tools for the development of R&D activities and the enhancement of competitiveness in enterprises. TW activities are able to detect opportunities and threats at an early stage and facilitate the information in to decide and carry out the appropriate strategies. The base of TW is the process of search, recovery, storage and treatment of information. The development of Text Mining solutions opens a new scenario for the development of TW activities. Up to

8

P. Markellou et al.

now the enterprises and organizations using Text Mining techniques in their TW and information management activities are a small minority. Only a few large industrial groups have integrated Text Mining solutions in their structure in order to build up their information management systems and develop TW and CI activities. The situation concerning smaller companies (especially SMEs) is obviously worse in which respect to the application of Text Mining techniques. The [30] focuses in the possible ways to introduce Text Mining solutions into SMEs, describing methodological and operative solutions that could bring them ways of proﬁting Text Mining advantages in their TW and CI activities without charging them the high costs of individual Text Mining solutions. The model presented is centered in the collective use of advanced Data Mining and Text Mining techniques in SMEs through industrial and R&D Intermediate Centers. The role of metadata and metainformation in the area in statistics is the theme in [31]. In the ﬁrst part, the paper is presenting some basic characteristics of the contemporary statistical information systems from the point of view of the needs for utilization of metadata and data/text mining. As it is well known, modern statistical systems are characterized by an enormous amounts of various statistical data what requires also speciﬁc methods and technologies for their processing. In the second part the mutual relations between metadata and metainformation are analysed and some conclusions and recommendations for the further research and development in that problem areas are presented. Biotechnology is a technological sector that had a tremendous growth during the past decade. It can be considered as a technology at its peak of development and in the centre of interest of scientists and companies. Consequently, those who are actively involved in the ﬁeld of biotechnology require the estimation of the existing situation and of the technological innovation in a reliable and scientiﬁc way. In order to study the research in the particular technological sector worldwide and particularly in Europe, Spinakis and Chatzimakri in [20] analysed patents, which were certiﬁed during the years 1995–2003. The analysis has been done with the use of STING software, which is specialised in the analysis of patents. The analysis of patents is based on the usage of simple statistics and multidimensional techniques, such as Correspondence Analysis, Factor Analysis and Cluster Analysis. For the particular research 2064 patents have been used. A search engine that enables an enhanced access to domain speciﬁc data available on the web was presented in [33]. This engine proposes a hybrid search interface combining query-based search with automated navigation through a tree-like hierarchical structure. An algorithm for automated navigation is proposed that requires NLP of the documents, including language identiﬁcation, tokenization, part-of-speech tagging lemmatization and entity extraction. European Social Survey collects a vast amount of event data, which are reported from the media of European countries. The basic aim of the event

Knowledge Mining

9

data selection is the creation of a database, which will constitute the source for the analysis of events and the extraction of information in relation to the impact of historical circumstance, in the shaping of attitudes. Stathopoulou in [34] presented the methodological approaches for event data analysis and exploration, through text mining techniques. In addition, a case study of event data analysis from the database of European Social Survey is also presented. The analysis was performed through the use of SPAD Software and SAS Text Miner.

5 Conclusions Knowledge mining is an enormous ﬁeld in the area of information extraction from diﬀerent sources. The conference managed to present new research results by combining well known techniques and algorithms. Applications (non commercial at the time being) have been demonstrated by providing examples of their use. The use of text mining algorithms, tools and techniques in real life problems was presented through the case studies. Knowledge mining is a rapidly growing interdisciplinary ﬁeld. In the forthcoming years more applications of KDD will be on the focus of researchers in academia and industry.

References 1. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, S., and Uthurusamy, R.: Advances in Knowledge Discovery and Data Mining, M.I.T. Press, 1996. 2. David Bell and Y. Bi, “An Evidential Approach to Classiﬁcation Decision Combination for Text Categorisation” in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 3. Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. The Fourteen International Conference on Machine Learning (ICML’97). 4. Yang, Y. (2001). A study on thresholding strategies for text categorization. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), pp. 137–145. 5. Guo, G., Wang, H., Bell, D., Bi, Y. and Kieran Greer, K. (2003). kNN modelbased approach in classiﬁcation. Cooperative Information Systems (CoopIS) International Conference. Lecture Notes in Computer Science, pp. 986–996. 6. Ittner, D. J. Lewis, D. D and Ahn, D. D. (1995). Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pp. 301–315. 7. Bi, Y., Bell, D., Wang, H., Guo, G. and Greer, K. Combining Classiﬁcation Decisions for Text Categorization: An Experimental Study. 15th International Conference on Database and Expert Systems Applications (DEXA’04), Lecture Notes of Computer Science by Spring-Verlag, pp. 222–231, 2004.

10

P. Markellou et al.

8. Lebart, L., Salem, A., Berry, L.: Exploring Textual Data. Kluver Academic Publishers, Dordrecht (1998). 9. Simona Balbi, Michelangelo Misuraca., “Visualization Techniques for Non Symmetrical Relationships” in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 10. Sergio Bolasco, Alessio Canzonetti, Federico M. Capo, Francesca Della RattaRinaldi, Bhupesh K. Singh, “Understanding Text Mining: a Pragmatic Approach” in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 11. D.K. Tasoulis and M.N. Vrahatis, “Novel Approaches in Unsupervised Clustering”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 12. M. N. Vrahatis, B. Boutsinas, P. Alevizos, and G. Pavlides. The new k-windows algorithm for improving the k-means clustering algorithm. Journal of Complexity, 18:375–391, 2002. 13. Ide, N., V´eronis, J.: Word Sense Disambiguation: The State of the Art. Journal of Computational Linguistics (1998) 24(1) 1–40. 14. Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of Word- and Sense-based Text Categorization Using Several Classiﬁcation Algorithms. Journal of Intelligent Information Systems (2003) 21(3) 227–247. 15. Scott, S., Matwin, S.: Feature Engineering for Text Classiﬁcation. In: Proc. of of ICML-99, 16th International Conference on Machine Learning (1999) 379–388 . 16. Hohto, A., Staab, S., Stumme, G.: WordNet improves Text Document Clustering. In: Proc. of the SIGIR 2003 Semantic Web Workshop (2003). 17. Bloehdorn, S., Hotho, A.: Boosting for Text Classiﬁcation with Semantic Features. In: Proc. of the SIGKDD 2004 MSW Workshop (2004). 18. D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, “Disambiguation for similarity retrieval in document collections”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 19. Gina Panagopoulou, “A Strategic Roadmap for Text Mining”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 20. Antoine Spinakis; Asanoula Chatzimakri, “Analysis of Biotechnology Patents” in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 21. Maria Teresa Pazienza, Marco Pennacchiotti, and Fabio Massimo Zanzotto, “Terminology extraction: an analysis of linguistic and statistical approaches”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 22. Federico Neri, Remo Raﬀaelli, “Text Mining applied to multi-lingual corpora”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 23. Furio Camillo, Melissa Tosi, Tiziana Traldi, “Semiometric approach, qualitative research and text mining techniques for modelling the material culture of happiness”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 24. Vangelis Karkaletsis, Constantine D. Spyropoulos, “The CROSSMARC platform for Web information retrieval and extraction” in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005.

Knowledge Mining

11

25. Bernd Drewes, “Some Industrial Applications of Text Mining” in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 26. Thierry Poibeau, “Content annotation for the Semantic Web”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 27. W3C. 1999. Resource Description Framework (RDF) Model and Syntax, W3C Recommendation, 22 Feb. 1999. 28. Alf Fyhrlund, Bert Fridlund and Bo Sundgren. “Using Text Mining in Oﬃcial Statistics”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 29. M´ onica B´ecue Bertaut, Martin Rajman, Ludovic Lebart, Eric Gaussier, “Extraction of the useful words from a decisional corpus Contribution of correspondence analysis” in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 30. Jorge(Gorka) Izquierdo, Sergio Larreina, “Collective SME approach to Technology Watch and Competitive Intelligence: the role of Intermediate Centers”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 31. Dusan Soltes, “New challenges and roles of metadata in text/data mining in statistics”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 32. Antonis Spinakis, Assanoula Chatzimakri, “Analysis of Biotechnology Patents”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 33. Martin Rajman, Martin Vesely, “Combining Text Mining and Information Retrieval Techniques for Enhanced Access to Statistical Data on the Web”, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005. 34. Theoni Stathopoulou, “Text Mining Tools for Event Data Analysis, in “Knowledge Mining” Springer Verlag, Series: Studies in Fuzziness and Soft Computing, S. Sirmakessis (Ed.), 2005.

An Evidential Approach to Classiﬁcation Combination for Text Categorisation D.A. Bell, J.W. Guan, and Y.X. Bi School of Computer Science, Queen’s University, Belfast, UK {da.bell, j.guan, y.bi}@qub.ac.uk

Abstract. In this paper we look at a way of combining two or more diﬀerent classiﬁcation methods for text categorization. The speciﬁc methods we have been experimenting with in our group include the Support Vector Machine, kNN (nearest neighbours), kNN model-based approach (kNNM), and Rocchio methods. Then we describe our method for combining the classiﬁers. A previous study suggested that the combination of the best and the second best classiﬁers using evidential operations [1] can achieve better performance than other combinations. We assess some aspects of this from an evidential reasoning perspective and suggest a reﬁnement of the approach.

1 Introduction Recently there has been a lot of development and application of diﬀerent learning methods for text categorization. Experimental assessment of the diﬀerent methods is the basis for choosing a classiﬁer as a solution to a particular problem instance. No single classiﬁer is universally best, and “horses for courses” is the order of the day [2]. It is desirable to develop an eﬀective methodology for combining them by taking advantage of the strengths of individual classiﬁers and avoiding their weaknesses. The beneﬁts of combining multiple classiﬁers based on diﬀerent classiﬁcation methods for text categorization (TC) have been studied in [3, 4, 5]. In [6], we presented a combination method for combining text classiﬁers based on the Dempster Rule for combination of evidence. We propose a structure for representing outputs from diﬀerent classiﬁers as three subsets based on the conﬁdence values of labels, called a focal element triplet. Such a triplet constitutes a piece of evidence. This serves the purpose of distinguishing important elements from trivial ones. In this paper we assess some aspects of this from an evidential reasoning perspective and suggest a reﬁnement of the approach.

D.A. Bell et al.: An Evidential Approach to Classiﬁcation Combination for Text Categorisation, StudFuzz 185, 13–22 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

14

D.A. Bell et al.

2 Text Categorisation Algorithms and their Outputs As in our previous studies, we outline 4 particular algorithms used in our studies. The Rocchio method was originally developed for query expansion by means of relevance judgements in information retrieval. It has been applied to text categorization by Ittner et al. [8]. There are several versions of the algorithm and we have implemented the version used by Ittner et al. kNN is an instance-based classiﬁcation method, which has been eﬀectively applied to text categorization in the past decade. In particular, it is one of the top-performing methods on the benchmark Reuters corpus [9]. Unlike most supervised learning algorithms that have an explicit training phase before dealing with any test document, kNN makes use of the local contexts derived from training documents to come to a classiﬁcation decision on a particular document. kNNModel is an integration of the conventional kNN and Rocchio algorithms [10]. It improves the kNN method by not being too dependent on our choice of k and generally reducing the storage requirements upon. Local models are treated as local centroids for the respective categories to avoid mis-clustering some data points when linearly clustering the space of data points. SVM (Support Vector Machine) is a high performance learning algorithm, which has been applied to text categorization by Joachims [11]. We have integrated a version of the SVM algorithm implemented by Chang and Lin [12] in our prototype system for text categorization. There are two advantages of this algorithm: the ﬁrst is that it has ability to cope with the multi-class classiﬁcation problem; and the second is that the classiﬁed results can be expressed as posterior probabilities that are directly comparable between categories. We now use a standard formulation of the categorisation problem. Let D = {d1 , d2 , . . . , d|D| } be a training set of documents, where di is represented by a weighted vector {wil,..., wim }, where wij is the weight of the single term and C = {c1 , c2 , . . . , c|C| } be a set of categories, then the task of assigning predeﬁned categories onto documents can be regarded as mapping which maps a boolean value to each pair d, c ∈ D × C. If value T is assigned to d, c, it means that a decision is made to include document d under the category c, whereas an F value indicates that document d is not under the category c. The task of learning for text categorization is to construct an approximation to a unknown function ϕ : D × C → {T, F }, where ϕ is called a classiﬁer. However, given a test document di , such a mapping cannot guarantee that an assignment of the categories to the document is either true or false; instead it is a set of numeric values, denoted by S = {s1 , s2 , . . . , s|C| }, which represent the relevance of the document to the list of categories in the form of similarity scores or probabilities, i.e. ϕ(di ) = {s1 , s2 , . . . , s|C| }, where the greater the score of the category, the greater the possibility of the document being under the corresponding category. It is necessary to develop a decision rule

An Evidential Approach to Classiﬁcation Combination

15

to determine a ﬁnal category of the document on the basis of these scores or probabilities.

3 Handling Uncertainty Uncertainty permeates our existence due to presence of randomness or chance in the nature of things. Information and knowledge (e.g. rules) pertinent to a given text categorisation instance often originate from diﬀerent sources and are often pervaded with uncertainty. The question arises: is there any way we could formalise the reasoning processes or otherwise make more visible for practical application how evidence (uncertain knowledge and information) pertinent to a situation is obtained from multiple sources and combined. Exploitation of the diﬀerent inputs usually requires combination operations such as Dempster Rule or the orthogonal sum [13] to solve the Data/ Information/Knowledge fusion problem. Making choices between conclusions, for example, involves ﬁnding the “best supported” conclusion based on all the available evidence. Most traditional approaches to evidential reasoning are based on numerical methods of representing evidential supports, but there have also been several studies on non-numerical and logical methods. For example, there are methods based on a logic for uncertain information representation and uncertain reasoning, that can be linked to the Dempster-Shafer (D-S) theory of evidence and it can make use of quantitative information if available. We now present our standard introduction to the theory. The D-S theory of evidence has been recognized as an eﬀective method for coping with such uncertainty or imprecision embedded in evidence used in the reasoning process. It is suited to a range of decision making activities. The D-S theory is often viewed as a generalization of Bayesian probability theory, by providing a coherent representation for ignorance (lack of evidence) and also by discarding the insuﬃcient reasoning principle. It formulates a reasoning process as pieces of evidence and hypotheses and subjects these to a strict formal process to infer conclusions from the given uncertain evidence, avoiding human subjective intervention to some extent. In the D-S theory, which we also refer to as evidence theory, evidence is described in terms of evidential functions. Several functions commonly used in the theory are mass functions, belief functions, commonality functions, doubt functions, and plausibility functions. Any one of these conveys the same information as any of the others. Deﬁnition 1. Let Θ be a ﬁnite nonempty set, and call it the frame of discernment. Let [0, 1] denote the interval of real numbers from zero to one, inclusive: [0, 1] = {x|0 ≤ x ≤ 1}. A function m : 2Θ → [0, 1] is called a mass function if it satisﬁes:

16

D.A. Bell et al.

(1) m(∅) = 0, (2) X⊆Θ m(X) = 1. A mass function is a basic probability assignment to all subsets X of Θ. A subset A of a frame Θ is called a focal element of a mass function m over Θ if m(A) > 0. Note that a focal element is a subset rather than an element of Θ. The union C of all the focal elements of a mass function is called its core: C = Ux,m(X)>0 X. A function bel: 2Θ → [0, 1) is called a belief function if it satisﬁes: (l) bel (∅) = 0, (2) bel (θ) = 1, (3) for any collection A1 , A2 , . . . , An (n ≥ 1) of subsets of Θ, bel (A1 ∪ A2 ∪ · · · ∪ An ) ≥

(−1)

|I|+1

bel (∩i∈I Ai ) .

I⊆(1,2...n),I=φ

This expression can be contrasted with conventional probability, where the inequality is replaced by an equality. The fundamental operation of evidential reasoning, namely, the orthogonal sum of evidential functions, is known as Dempster-Shafer’s rule for combining evidence. Let m1 and m2 be mass functions on the same frame Θ. Suppose m (X)m (Y ) < 1. Denote N = 1 2 X∩Y =φ X∩Y =φ m1 (X)m2 (Y ). Then the function m: 2Θ → [0,1] deﬁned by: (l) m(∅) = 0, and (2) m(A) = (1/N ) X∩Y =A m1 (X)m2 (Y ) for all subsets A = ∅ of Θ is a mass function. The mass function m is called the orthogonal sum of m1 and m2 , and is denoted m1 ⊕ m2 . If N = X∩Y =φ m1 (X)m2 (Y ) = 0, then we say that the orthogonal sum m1 ⊕ m2 does not exist, and that m1 and m2 ; are totally contradictory. Generally, K = 1/N is called the normalization constant of the orthogonal sum of m1 and m2 . We now present our standard deﬁnitions of evidence obtained from text classiﬁers and the mass and belief functions for this domain. Then we show how these pieces of evidence can be combined in order to reach a ﬁnal decision. Deﬁnition 2. Let C be a frame of discernment, where each hypothesis ci ∈ C is a proposition of the form “document d is of category ci ”, and let ϕ(d) be a piece of evidence that indicates the strength (ci ) of our conﬁdence that the document comes from each respective category ci ∈ C. Then a mass function is deﬁned as a mapping, m : 2C → [0, 1], i.e. a basic probability assignment (bpa) to ci ∈ C for 1 ≤ i ≤ |C| as follows: (ci ) m({ci }) = |C| j=0 (cj )

where

1 ≤ i ≤ |C|

An Evidential Approach to Classiﬁcation Combination

17

This expresses the degrees of beliefs in respective propositions corresponding to each category to which a given document could belong. With this formula, the expression of the output information ϕ(d) is rewritten as ϕ(d) = {m({c1 }), m({c2 }), . . . , m({c|C| })}. Outputs from diﬀerent classiﬁers can be then combined using the orthogonal sum. We have developed a new structure, called a focal element triplet, which partitions ϕ(d) into three subsets. A number of empirical evaluations have been carried out to examine its eﬀectiveness. More theoretical work based on its validity and combinability can be found in [7]. Deﬁnition 3. Let C be a frame of discernment and ϕ(d) = {m({c1 }), m({c2 }), . . . , m({c|C| })}, where |ϕ(d)| ≥ 2. A focal element triplet is deﬁned as an expression of the form Y = A1 , A2 , A3 , where A1 , A2 ⊆ C are singletons, and A3 is the whole set C. These elements are given by the formulae below: A1 = {ci }, ci = max {m({c1 }), m({c2 }), . . . , m({c|C| })} A2 = {cj }, cj = max {{m({c1 }), m({c2 }), . . . , m({c|C| })} − m({ci })} A3 = C The associated mass function is as follows: m(A1 ) = m({ci }) m(A2 ) = m({cj }) m(A3 ) = 1 − m({ci }) − m({cj }) In [6, 7], we make the assumption that the classes to be assigned to a given instance only be among the top choice, the top second choice, or the whole of the frame in descending order. It is then possible that the top second choice will be ranked as the top choice when we combine multiple classiﬁers. This assumption forms the rationale behind dividing ϕ(d) into a triplet, and it is the main issue in ths paper we consider.

4 The Triplet Combination Method Suppose we have multiple learning algorithms and a set of training data. Each algorithm can generate one or more classiﬁers based on the training data. Outputs of diﬀerent classiﬁers on the same testing documents can be combined using the orthogonal sum to make the ﬁnal classiﬁcation decisions. To make the process more clear, we present an example from a previous study. Suppose we are given two triplets A1 , A2 , C and B1 , B2 , C where Ai ⊆ C, Bi ⊆ C, and two associated triplet mass functions m1 , m2 . Consider two pieces of evidence, obtained from two classiﬁers kNN (k-nearest neighbours) and SVM (Support Vector Machine), respectively, represented in XML below in Fig. 1:

18

D.A. Bell et al.

Result 1 (SVM): {c1} {c2} {c1, c2, c3, c4, c5, c6} Result 1 (kNN): {c2} {c4} {c1, c2, c3, c4, c5, c6} Fig. 1. Outputs produced by kNN and SVM (c1:comp.windows.x; c2: comp. graphics; c3:comp.sys.ibm.pc.hardware; c4:comp.sys.mac.hardware; c5: comp.os.mswindows.misc; c6: alt.atheism)

In this example, C = {c1 , c2 , c3 , c4 , c5 , c6 } is the frame of discernment, and we use triplets A1 , A2 , C and B1 , B2 , C to represent the two results, i.e. A1 , A2 , C = {c1 }, {c2 }, {c1 , c2 , c3 , c4 , c5 , c6 } and B1 , B2 , C = {c2 }, {c4 }, {c1 , c2 , c3 , c4 , c5 , c6 }, respectively. The corresponding mass functions for document 37928 are shown in Fig. 1. For example, the mass function given by SVM is m({c1 }) = 0.724, m({c2 }) = 0.184, and the ignorance m({c1 , c2 , c3 , c4 , c5 , c6 }) = 0.092. We can obtain a set of aggregated results. Since A1 , A2 , B1 , B2 are singletons, the belief function is the same as the new mass function m. Therefore we have a set of strengths of belief with 3 possible categories as a combined result: {bel (A1 ), bel (A2 ), bel (B2 )}. By choosing the category with the maximum degree of belief as a ﬁnal decision, we have D(37928) = A2 = c2. Thus the ﬁnal decision made by the combined classiﬁer is category c2 – the decision made by the kNN classiﬁer. By repeatedly computing pairwise of orthogonal sums in Fig. 1, we combine all of the triplet mass functions. Now an obvious conjecture is that we could improve on this method by using a threshold for the allocation to ignorance. Suppose we use a threshold of 0.1 for ignorance – i.e. the mass allocated to the whole set (Frame of Discernment) is 0.1 or less. Then the order of the ﬁnal choices might be diﬀerent, although this is unlikely in the example above. We can gain insights into this by making a simple analysis for 3 classes. There are a limited number of permutations of classes A, B, C to consider, and by the nature of the problem space some of these cannot occur. For example the two orders would not start with the same class, as the most strongly supported class in both lists will then be the same and it would clearly be the overall “winner”. Consider the combination of two pieces of evidence m1 and m2 in the case where we have a quartet rather than a triplet. So we keep the best 3 categorisations, where m1 (A) = a, m1 (B) = b and m1 (C) = c, and m2 (A) =

An Evidential Approach to Classiﬁcation Combination

19

em2 (C) = d and m2 (B) = f . Suppose, for illustration, that the second set of results brings the third rated categorisation to the top. Using ⊕ we get the result in Fig. 2. m1 m2

Cc

r

Cd

…………….dc

dr

Ae

ea .………………. er

Bf

……… fb ………… fr

s

Aa

sa

Bb

sb

sc

sr

Fig. 2. The orthogonal sum of two mass functions in the “best 3 categorisation” (where r = 1 − a − b − c and s = 1 − d − e − f )

Here support for A is: Support for B is: ae + e(1 − a − b − c) +a(1 − d − e − f )

Support for C is:

f b + f (1 − a − b − c) dc + d(1 − a − b − c) +b(1 − d − e − f ) +c(1 − d − e − f )

We can then draw some interesting conclusions for particular cases using some simple algebra. For example, we can re-write the conditions for to be the best A as: a(e + s) > f b − (e − f )r + b s

AND a(e + s) > ec − (d − e)r + cs

Remember that r and s are the masses assigned to ignorance in m1 and m2 resp. Then we can say such things as; because a(e + s) > f b, we can say that the ﬁrst condition always holds when r > s and (e − f ) > b. This would happen, for example, if we had a = 0.6, b = 0.2, c = 0.1 and d = 0.45, e = 0.35, f = 0.14. Now, for C to be better supported than A, we need to have ae + e(1 − a − b − c) + a(1 − d − e − f ) < dc + d(1 − a − b − c) + c(1 − d − e − f ) i.e. e(1 − b − c) + a(1 − d − e − f ) < d(1 − a − b) + c(1 − d − e − f ) i.e. e(1 − b) + a(1 − e − f ) < d(1 − b) + c(1 − d − f ) i.e. a < (d − e)(1 − b)/(1 − e − f ) + c((1 − d − f )/(1 − e − f ) Now consider the case when only the best two of each categorisation method are used (i.e. using a triplet as in [5]). The result is in Fig. 3. Here support for A is:

Support for C is:

e(1 − b) + a(1 − d − e)

d(1 − a − b)

20

D.A. Bell et al.

m1 m2

Aa

Bb

r

Cd

…………….dr

Ae

ea .…………er

s

sa

sb

sr

Fig. 3. The orthogonal sum of two mass functions in the best 2 categorisation (where s = 1 − d − e and r = 1 − a − b)

A is the better choice when: a > (d − e)(1 − b)/(1 − e) e.g. if e = 0.1 and b = 0.1 or more, a > d − 0.1 . The interesting thing here is that we can write: when f is small. a > (d − e)(1 − b)/(1 − e) means A is better for a triple; a < (d − e)(1 − b)/(1 − e) + c(1 − d)/(1 − e) means C is better for a quartet. So considering two masses, gives A better when a > (d − e)(1 − b)/(1 − e). However, when a quartet rather than a triple is considered, this changes to C better when C is such that a < (d − e)(1 − b)/(1 − e) + c(1 − d)/(1 − e)

5 Discussion of the Use of Dempster’s Rule Here A number of properties, including what some would say are prima face weaknesses, of the D-S rule have been identiﬁed, exhaustively analysed and dealt with in the literature over the years. We mention a few of these here. On the plus side, for belief functions, the orthogonal sum gives a result which is independent of the order in which the combinations take place (commutative and associative). Also a combination of belief functions gives another belief function. On the negative side the belief functions to be combined have to be based on distinct pieces of evidence. There are strict rules under which the Orthogonal Sum can be used. For example in the case of TC it could be argued that the pieces of evidence are not entirely independent and multiple agents methods should be used instead. But these issues are beyond the scope of the present study.

An Evidential Approach to Classiﬁcation Combination

21

6 Conclusion We suggest the extension of a novel method and technique for representing outputs from diﬀerent classiﬁers – a focal element triplet – to a focal element quartet, and use an evidential method for combining multiple classiﬁers based on this new structure. The structure, and the associated methods and techniques developed in this research are particularly useful for data analysis and decision making under uncertainty. They can be applied in many decision making exercises which knowledge and information are insuﬃcient and incomplete.

References 1. Bi, Y., Bell, D., Wang, H., Guo, G. and Greer, K. Combining Classiﬁcation Decisions for Text Categorization: An Experimental Study. 15th International Conference on Database and Expert Systems Applications (DEXA’04), Lecture Notes of Computer Science by Spring-Verlag, pp. 222–231, 2004. 2. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34 (1), 2002. 3. Larkey, L.S. and Croft, W.B. (1996) Combining classiﬁers in text categorization. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pp. 289–297. 4. Li, Y.H. and Jain, A.K. (1998). Classiﬁcation of Text Documents. The Computer Journal, Vol 41(8), pp. 537–546. 5. Yang, Y., Thomas Ault, Thomas Pierce. (2000). Combining multiple learning strategies for eﬀective cross validation. The Seventeenth International Conference on Machine Learning (ICML’00), pp. 1167–1182. 6. Bi, Y., Bell, D., Wang, H., Guo, G. and Greer, K. Combining Multiple Classiﬁers Using Dempster’s Rule of Combination for Text Categorization. Proceedings of Modelling Decision for Artiﬁcial Intelligence Conference. Lecture Notes on Artiﬁcial Intelligence by Spring-Verlag, pp. 127–138, 2004. 7. Bell, D., Guan, J., Bi, Y. On Combining Classiﬁer Mass Functions for Text Categorisation (to appear) IEEE Transactions on Knowledge and Data Engineering. 8. Ittner, D. J. Lewis, D. D and Ahn, D. D. (1995). Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pp. 301–315. 9. Yang, Y. (2001). A study on thresholding strategies for text categorization. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), pp. 137–145. 10. Guo, G., Wang, H., Bell, D., Bi, Y. and Kieran Greer, K. (2003). kNN modelbased approach in classiﬁcation. Cooperative Information Systems (CoopIS) International Conference. Lecture Notes in Computer Science, pp. 986–996. 11. Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. The Fourteen International Conference on Machine Learning (ICML’97).

22

D.A. Bell et al.

12. Chang, C. C and Lin, C. J. (2001). LIBSVM: a library for support vector machines (http://www.csie.ntu.edu.tw/∼cjlin/libsvm). 13. Guan J., Bell D.A. (1991), Evidence Theory and its Applications, NorthHolland. 14. Mitchell, T. (1997). Mitchell. Machine Learning. McGraw-Hill. 15. Shafer, G. (1976). A Mathematical Theory of Evidence, Princeton University Press, Princeton, New Jersey.

Visualization Techniques for Non Symmetrical Relations Simona Balbi and Michelangelo Misuraca Dipartimento di Matematica e Statistica, Universit` a “Federico II” di Napoli, 80126 Napoli, Italy {sb, mimisura}@unina.it dms.unina.it/adt/index.html Abstract. Many strategies of Text Retrieval are based on Latent Semantic Indexing and its variations, by considering diﬀerent weighting systems for words and documents. Correspondence Analysis and L.S.I. share the basic algebraic tool, i.e. the Singular Value Decomposition and its generalisation, related to the use of a diﬀerent way for measuring the importance of each element, both in determining and representing similarities between documents and words. Aim of the paper is to propose a peculiar factorial approach for better visualizing the relations between textual data and documents, compared with classical Correspondence Analysis. Here we consider a term frequency/document frequency index scheme, mainly developed for Text Retrieval, in a textual data analysis context. An application on Italian Le Monde Diplomatique corpus (about 2000 articles published from 1998 to 2003 in the Italian edition of LMD) will show the eﬀectiveness of the proposal.

1 Introduction The increasing availability of e-documents makes necessary the development of tools for automatically extracting and analyzing the most informative data. In the information mining step of a Text Mining strategy, it is often useful to apply multidimensional data analysis techniques. In this paper we focus our attention on Correspondence analysis, in order to investigate the language used in a large corpus, in terms of association structure in a lexical table. The aim is summarizing and graphically representing the most meaningful information contained in the collection. In this work we propose a peculiar factorial approach in order to better visualize the relations between textual data and documents, compared with classical Correspondence Analysis (CA) on lexical tables [6] results. CA is based on Chi-square metric, for giving the same attention to the behavior of frequent and infrequent elements. A negative consequence, mainly in analyzing very sparse tables, as lexical ones, is leading the reader to a misinterpretation of the factorial maps. In order to avoid this problem, Balbi [1] proposes S. Balbi and M. Misuraca: Visualization Techniques for Non Symmetrical Relations, StudFuzz 185, 23–29 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

24

S. Balbi and M. Misuraca

the use of Lauro and D’Ambra’s Non Symmetrical Correspondence Analysis [5] for analyzing aggregated lexical tables, i.e. tables in which documents are aggregated with respect to some prior information. This method enables to compute distances between documents by a usual Euclidean metric, making the reading of the maps (distances between words are still Chi-square ones) more natural enhancing the diﬀerent role played by the two way of the aggregate lexical table (vocabulary depending on the aggregation criterion of documents). This extreme solution seems not to be completely satisfying, mainly if we are interested in studying the similarities between documents, as it gives a too high importance to common words. This problem has been deeply investigated in a text categorisation frame, and more generally in Text Retrieval strategies, where the term frequency is often dampened by a function because more occurrences of a word reﬂect a higher importance, but not as much relative importance as the undampened count would suggest [4]. In this paper we try to propose a diﬀerent metric, which takes into account the wide literature on word and document weighting systems, mainly developed for Text Retrieval, in order to consider word and document frequencies (as for Chi-square), but in a softened way. The aim is to consider similarities between group of documents in terms of their lexical richness, not taking into account their contents. An application on the Italian Le Monde Diplomatique corpus (ILMD6), which contains about 2000 articles published from 1998 to 2003 in the Italian edition of the newspaper Le Monde Diplomatique, will show the eﬀectiveness of our proposal.

2 Basic Concepts on the Problem of Word Importance in a Text Many strategies of Text Retrieval are based on Latent Semantic Indexing [3] and its variations, mainly based on diﬀerent weighting systems for words and documents. Correspondence Analysis and Latent Semantic Indexing share the basic algebraic tool, i.e. the Singular Value Decomposition, and its generalisations [2]. Generalisation concerns the use of a diﬀerent way for measuring the importance of each element, both in determining and representing similarities between documents and word use. In this work we propose the introduction of a term frequency/document frequency index scheme (tf /df ) in the information mining step. The tf /df family of vector based information retrieval schemes [7] is very popular because of its simplicity and robustness. Some considerations on the peculiarities due to working on texts are the conceptual bases of the approach: • as more frequent terms in a document are more indicative of the topic, it is important to consider fik = frequency of term i in document k;

Visualization Techniques for Non Symmetrical Relations

25

• a normalisation of fik can be proper, by considering the number of occurrences of the most used term in each document, introducing tf ik : tfik = fik / max fk .

(1)

where maxfk is the term which occurs more times in the k-th document; • as terms that appear in many diﬀerent documents are less indicative of overall topic, it is important to measure term’s discrimination power, by means of the index idf i . Naming df i the document frequency of term i(# documents containing term i), the “inverse document frequency” of term i is given by idf i = log2 (n/df i ), with n as number of all documents. Logarithm has been suggested to dampen the eﬀect related to term frequency. A typical combined term importance indicator is given by tf-idf weighting: wik = tfik idfi = fik / max fk log2 (n/dfi ) .

(2)

The eﬀect of using wik is that a term i, occurring frequently in a document k but rarely in the rest of the collection, has a high weight. Many other ways of determining term weights have been proposed, but experimentally, tf-idf has been found to work properly.

3 Our Data Structure Let us consider a collection of n documents, and let T be a matrix whose general element is given by the frequency of the i-th word in the k-th document (i = 1, . . . , p; k = 1, . . . , n). Furthermore let Q be an indicator matrix, containing some information for each document (e.g. year of publishing, topic, author, etc.). Through the matrix product between T and Q we obtain an aggregated lexical table K which cross-classiﬁes the p words with the q categories considered for the documents. The row -marginal distribution ki is the distribution of the words in the whole collection. The marginal distribution of the j columns (j = 1, . . . , q) is given by the total number of words in each document category. From a geometrical viewpoint, our purpose is to project the cloud Nq , representing the q categories, in a lower dimensional subspace Rm∗ , with m∗ < m = [min(p, q) − 1], by assuming a unitary weighting system and a peculiar weighted Euclidean metric. Because of the diﬀerent role played by rows and columns, we assign the same importance to all categories, but we measure the distance between categories by taking into account the diﬀerent weight of the p words, in terms of term frequency index. Let us consider the relative frequencies matrix F. Given the i-th word frequency fij , we consider the tf ij as: tfij = fij / max fj .

(3)

26

S. Balbi and M. Misuraca

Fig. 1. By multiplying the lexical table p Tn (words x documents) and n Qq (documents x categories) we obtain the aggregated lexical table p Kq (words x categories)

where max fj is the number of occurrences of the most used word in the jth category. By considering the number of documents in each category as weights, we compute for each word the average tf as: atfi = 1/n Σj (fij / max fj ) nj .

(4)

Let Ω ≡ [f1 . . . atfp ]T be the vector of the p average tf, we consider as metric: DΩ ≡ diag(Ω) .

(5)

From a mathematical point of view, the method leads oﬀ with the eigenanalysis of the matrix: (6) A ≡ FT (DΩ )−1 F . i.e. with the generalized singular value decomposition (gSVD) of: T

−1

U (DΩ )

F = U Λ VT .

(7)

U = VT V = I .

(8)

where Λ is the diagonal matrix whose elements are the square roots of the eigenvalues λα of A, while (DΩ )−1/2 U and V are respectively its left and right eigenvectors. The factorial coordinates on the α-th axis of the j categories in Rm∗ are: 1/2 vαj . ϕαj = λα

(9)

4 Exploring the corpus ILMD6 The corpus ILMD6 contains the articles published from 1998 to 2003 in the Italian edition of the newspaper Le Monde Diplomatique (LMD). Each number of LMD is a translation of the original monthly edition, together with some book reviews, drawn up by the Italian editorial staﬀ. The language is quite homogeneous because the translations from French to Italian are made by the same few persons. However, words are often translated in diﬀerent ways by diﬀerent translators.

Visualization Techniques for Non Symmetrical Relations

27

Table 1. Topics classiﬁcation # Topic

# Topic

# Topic

1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22

23 24 25 26 27 28 29 30 31 32

Ambiente e salute Ambiente e sviluppo Armi/Strategie Balcani/Guerre & Pace Chiese e religioni Comunicazione/Internet Crisi e ﬁnanza mondiale Cultura Destra/Estrema Destra Disoccupazione Dittature

Donne Ex URSS e Europa Or. Geopolitica Giustizia/Diritto Inter. Globalizzazione Conﬂitti etnici Islam e Musulmani Israele e Palestina Liberismo/OMC Libert` a/Diritti umani Media e giornalismo

Medio Oriente Migrazioni/Razzismo Minoranze Nord/Sud Liberismo Povert` a/Esclusione Sicurezza/Contr.sociale Societ` a Storia e memoria Terrorismo Unione Europea

The whole six year collection has been downloaded from the newspaper website and it has been automatically converted in text format with a script written in Java language. For each article, only the body has been considered (i.e. titles and subheadings have been ignored). A set of 1914 articles has been manually selected, and categorised in 32 topics on the basis of the main subject. The articles have been normalised in order to reduce the possibility of data splitting, for example by converting all the capital letters to the lower case or conforming the transliteration of words coming from other alphabets, mainly proper nouns, or using the same notations for acronyms or dates. In this way, we have obtained a corpus of about 3 million of occurrences. After carrying out a quite in-depth lexicalization in order to avoid trivial cases of ambiguity, a vocabulary of more than 77000 textual forms1 (with a hapax level of 39.9%), partially marked with Part of Speech tags, has been made out. We decide to perform a stemming procedure only on verbs, because an automatic lemmatization on large corpora may frequently cause a loss of useful information, even if the work has been focused on topics (Table 1). Here we show the results obtained from the analysis of a 455000 occurrences training set, collected by taking into account only the 311 articles published in 2001. In the framework of NSCA, it is possible to separately visualize the subspace spanned by the forms and the subspace spanned by documents. In Fig. 1 the relationships between document categories, in terms of dependence, have been represented. On the right side of the factorial map there are the 1

In the textual forms we do not include only graphical forms but also the forms resulting from a compounding process, as multiwords and polywords.

28

S. Balbi and M. Misuraca

Fig. 2. NSCA representation of document topics: ﬁrst and second principal axes

categories related to the conﬂicts in Balkans and Middle East (e.g. Islam, Ethnical Conﬂicts, Terrorism); on the left side there are the categories related to international economic and law issues (e.g. Globalization, Liberalism, Human Rights).

Fig. 3. Factorial representation of document topics: ﬁrst and second principal axes

In order to visualize the similarity between the categories, with respect to the lexical richness of their speciﬁc vocabularies, all topics have been reported on the factorial map in Fig. 2. The topics closer to the axes origin use a narrowranging vocabulary. Furthermore, if two topics are near they have the same

Visualization Techniques for Non Symmetrical Relations

29

lexical richness, in terms of used words. The dependence structure between categories is quite similar to the other one but it is diﬀerent because of the peculiar metric.

5 Conclusions and Perspectives In this paper the opportunity of graphically representing the similarity between documents has been shown, in terms of lexical richness, by using a peculiar factorial approach. A tf /df -based metric is considered in the subspace spanned by the terms for measuring the distances between documents. We think that further developments can be achieved by considering the introduction of a proper weighted Euclidean metric in the sub-space spanned by documents, for visualizing words association. Moreover, in order to better understand the relations between the documents and the language used, the development of more powerful graphical tools in textual data analysis will be deeply studied, in the frame of a Visual Text Mining.

References 1. Balbi, S.: Non symmetrical correspondence analysis of textual data and conﬁdence regions for graphical forms. In: Bolasco, S., et al. (eds.): Actes des 3es Journ´ees internationales d’Analyse statistique des Donn´ees Textuelles. Vol. 2. CISU, Roma (1995) 5–12 2. Balbi, S., Di Meglio, E.: Contributions of Textual Data Analysis to Text Retrieval. In: Banks, D., et al. (eds.): Classiﬁcation, Clustering and Data Mining Applications. Proceedings of 9th Conference of the International Federation of Classiﬁcation Societies. Springer-Verlag, Heidelberg-Berlin (2004) 511–520 3. Deerwester, S., et al.: Indexing by latent semantic analysis. In: Journal of the American Society for Information Science. Vol. 6. (1990) 391–407 4. Grassia, M.G., Misuraca, M., Scepi, G.: Relazioni non simmetriche tra corpora. In: Purnelle, G., et al. (eds.): Le poid des mots. Actes des 7es Journ´ees internationales d’Analyse statistique des Donn´ees Textuelles. Vol. 1. UCL Presses (2004) 524–532 5. Lauro, N.C., D’Ambra, L.: L’analyse non sym´etrique des correspondances. In: Diday, E., et al. (eds.): Data Analysis and Informatics. NH, Amsterdam (1984) 433–446 6. Lebart, L., Salem, A., Berry, L.: Exploring Textual Data. Kluwer Academic Publishers, Dordrecht (1998) 7. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In: Information Processing & Management. Vol. 5 (1988) 513–523

Understanding Text Mining: A Pragmatic Approach Sergio Bolasco, Alessio Canzonetti, Federico M. Capo, Francesca della Ratta-Rinaldi, and Bhupesh K. Singh1 Universit` a “La Sapienza” di Roma, Via del Castro Laurenziano 9, 00161, Roma, Italy {sergio.bolasco, alessio.canzonetti,francesca.dellaratta}@uniroma1.it [email protected] [email protected] geostasto.eco.uniroma1.it/ Abstract. In order to delineate the state of the art of the main TM applications a two-step strategy has been pursued: ﬁrst of all, some of the main European and Italian companies oﬀering TM solutions were contacted, in order to collect information on the characteristics of the applications; secondly, a detailed search on the web was made to collect further information about users or developers and applications. On the basis of the material collected, a synthetic grid was built to collocate, from more than 300 cases analysed, the 100 ones that we considered most relevant for the typology of function and sector of activity. The joint analysis of the diﬀerent case studies has given an adequate picture of TM applications according to the possible types of results that can be obtained, the main speciﬁcations of the sectors of applications and the type of functions. Finally it is possible to classify the applications matching the level of customisation (followed in the tools development) and the level of integration (between users and developers). This matching produces four diﬀerent situations: standardisation, outsourcing, internalisation, synergism.

1 The Approach to the Study Making correct decisions often requires analysing large volumes of textual information. Text Mining is a budding new ﬁeld that endeavours to garner meaningful information from natural language text. Text Mining is the process of applying automatic methods to analyse and structure textual data in order to create useable knowledge from previously unstructured information. Text 1

This work comes from a common eﬀort. S. Bolasco wrote Sect. 1, A. Canzonetti Sects. 2 and 3, F. Capo Sects. 4.1 and 4.3, F. della Ratta-Rinaldi Sect. 4.2, B. Singh Sect. 4.4 and S. Bolasco & F. della Ratta-Rinaldi Sect. 5.

S. Bolasco et al.: Understanding Text Mining: A Pragmatic Approach, StudFuzz 185, 31–50 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

32

S. Bolasco et al.

Mining is inherently interdisciplinary, borrowing heavily from neighbouring ﬁelds such as data mining and computational linguistics [11, 18, 25]. In this paper we focused on the state of the art of the main TM applications: we considered corporate users/clients, companies, scientiﬁc communities and others who may have used TM techniques to achieve their goals. The starting point was to analyse the work carried out by the companies that have been developing software solutions for TM for some time now. Some of the main European and Italian companies oﬀering TM services were contacted, in order to collect information on the characteristics of the applications, the type of technology used, the problems and solutions, and possible future scenarios. Having examined the features of main Italian companies [3], the research was extended to other important international market players (SAS, Spss, Temis, Inxight, etc.). An information bank of customer that used TM instruments for Business Intelligence or researches was also created. The analysis of cases allowed us to identify three alternative points of view by which to interpret the TM applications: the functions that they satisfy, the sectors of activity and the type of results that could be obtained. By linking the functions with the sectors of applications, a grid was constructed where the 100 most relevant case studies are placed (see Table 1). The Applications can be divided into four main typologies in relation to: Knowledge Management (KM) and Human Resources (HR); Marketing ranging from Customer Relationship Management (CRM) to Market Analysis (MA); Technology, ranging from Technology Watch (TW) to Patent Analysis (PA) and, lastly, Natural Language Processing (NLP). Within these four macro-groups, the eleven categories of functions reported below were identiﬁed: • KM & HR: Support and decision making and Competitive Intelligence, Extraction Transformation Loading (ETL: transformation of free text into structured text and database ﬁlling), Human Resources Management (employee motivation and CV analysis); • CRM & MA: Customer care and CRM, Market Analysis, Customer Opinion Analysis on virtual communities (mail and newsgroup); • TW : Patents, Scientiﬁc abstracts and Financial news; • NLP : Questioning in Natural Language and search engines, Multilingual applications, Voice Recognition. We have presented the series of cases of the applications as complete as possible by conducting a search on the web2 : the joint analysis of the diﬀerent “case studies” has given an adequate picture of the state of the art of TM analysing applications according to: 2

A proﬁle was compiled for each case, brieﬁng systematically the initial problem, the solution and the results obtained and the group of sources that were consulted to draft the proﬁles. All case studies are available on the Nemis website in the second part of document titled Nemis WG3 Final Report 2004, see web site [1].

Table 1. Some main case studies by sector and application typology

Understanding Text Mining: A Pragmatic Approach 33

34

S. Bolasco et al.

(a) the possible types of results to be obtained; (b) the main speciﬁcations of the sectors of application; (c) the type of functions. The analysis of functions is crucial because it allows to understand and highlight main objectives of applications.

2 The Output of Text Mining The applications analysed present a considerable variety from the point of view of objectives, the type of tests analysed and the strategy of processing and analysis chosen. As a rule, it can be said that a text mining strategy [Rajman and Vesley, 2004] is preceded by two essential phases: • The pre-processing phase where text retrieval, formatting and ﬁling is done. • The lexical processing involves identiﬁcation and lemmatisation of words. Following these two phases, the actual Text Mining processing is extremely diversiﬁed, as it is strictly linked to the objectives to be achieved, which are: 1. automatic analysis of documents and their categorisation/classiﬁcation for the successive information retrieval; 2. search for relevant entities for information extraction; 3. formulation of queries in natural language, interpreted by NLP processes based on algorithms of artiﬁcial intelligence; 4. processing of multilingual texts for the retrieval of information independent of the original language of the documents. 2.1 Automatic Categorisation/Classiﬁcation of Documents The automatic analysis of documents is aimed at getting diﬀerent types of results: (a) the classiﬁcation of documents within a predeﬁned grid of categories; (b) the clusterisation of the texts according to conceptual similarity or vocabulary; (c) the extraction of semantic information on the text; (d) the text summarisation. (a) The case of document classiﬁcation in a predeﬁned “grid” of categories is probably the most frequent. This technique is used for example for the management of document bases, as in the case of big publishers, of information of a judicial nature, and for CRM applications that foresee automatic message routing [8, 27]. In order to carry out the classiﬁcation, these programmes use a knowledge base – speciﬁc with respect to the type of documents being analysed – which makes it possible to recognise the entities and the key concepts permitting the classiﬁcation.

Understanding Text Mining: A Pragmatic Approach

35

Usually the operation of placing a set of documents in a predeﬁned grid was carried out by companies manually before the introduction of TM systems, which were extremely demanding from the point of view of the time involved and the human resources employed. This is perhaps one of the cases in which the advantages of the introduction of the automatic application is most evident: the saving of time and resources made possible by automatic application and level of reliability of the result. (b) On the other hand, in the case in which a grid to classify the documents is not available, clusterisation techniques are used, which separates/groups together the documents into groups according to the similarity of their contents. Clusterisation is most frequent process used in the cases in which the content of the documents undergoing analysis is subject to high variability and is often unknown to the user (as in the case of the documents extracted from a search engine): the subdivision into groups makes it possible to have an idea of the conceptual domains that the documents belong to, since it is usually possible to consult the list of words characterising each cluster (for example, by the use of the TFIDF index [23, 24]). The document clusterisation procedures is followed in some cases by the use of search engines that facilitate the retrieval of ﬁled information, automatically associating the documents containing the concept being searched for. Clusterisation can be used not only for information retrieval, but also for identifying trends and topics by reading the texts. Thus, clusterisation allows to achieve an organized overview of the topics contained in the documents. (c) The analysis can be aimed at the extraction of relevant semantic contents of the text being examined; as in the case of Customer Opinion Analysis (COA), used in marketing applications in which a set of messages are analysed to obtain information concerning customers’ opinions. (d) Finally, the Text Summarisation makes it possible to create summaries and/or abstracts of documents automatically. The Text Summarisation procedures [15] carry out the linguistic/statistical analysis of the document under examination to identify the topics dealt with and to eliminate the insigniﬁcant parts for the purposes of synthesis. 2.2 Search for Relevant Entities The applications dedicated to the search for relevant information do not foresee the classiﬁcation of documents but the formulation of an answer to a speciﬁc query. Information extraction [20] is frequent in the applications of Competitive Intelligence, Technology Watch and Market Analysis, having in common the objective of extracting strategic information from a vast amount of documents. In most cases statistical techniques of data reduction are used [14]. The information extracted are generally lists of competitors and the products oﬀered, lists of potential clients, identiﬁcation of partnerships between

36

S. Bolasco et al.

companies, investments news or experimenting in new markets, information on new technologies or patents. 2.3 Formulation of Queries in Natural Language Natural Language Processing forms the basis of most TM processes. Among the most signiﬁcant applications that use this type of linguistic technologies are those that permit the management of queries in natural language, used above all for CRM or eGovernment. They are applications that facilitate the contact and the retrieval of information on Internet (or on intranet) for users who are not particularly familiar with the language of queries. The result of this process is the extraction of information with the criterion of the maximum precision and minimum eﬀort on the part of the user. 2.4 Processing of Multilingual Texts A sector in continuous expansion, which will undoubtedly represent one of the future TM developments, is that of the management and interpretation of multilingual corpora. The ability to interact simultaneously with texts drafted in diﬀerent languages (starting with special dictionaries in which the “translators” have been contextually tested) is a potentiality used above all in the ﬁeld of search engines that make it possible to extract documents of interest, using multilingual platforms. A speciﬁc case study has been carried out by the company Synthema [17] on the extraction of information from multilingual corpora in the context of the NEMIS project.

3 The Sectors of Text Mining Application The main TM applications are most often used in the following sectors: • • • • • •

Publishing and media; Banks, insurance and ﬁnancial markets; Telecommunications, energy and other services industries; Information technology sector and Internet; Pharmaceutical and research companies and healthcare; Political and judicial institutions, political analysts, public administration.

The sectors analysed are characterised by a fair variety in the applications being experimented; however, it is possible to identify some sectorial speciﬁcations in the use of TM, linked to the type of production and the objectives of the knowledge management leading them to use TM. The publishing sector, for example, is marked by prevalence of Extraction Transformation Loading applications for the cataloguing, producing and the optimisation of the information retrieval. In the banking and insurance

Understanding Text Mining: A Pragmatic Approach

37

sectors, on the other hand, CRM applications are prevalent and aimed at improving the management of customer communication, by automatic systems of message re-routing and with QNL applications supporting the search engines asking questions in natural language. In the institutional ﬁeld, ETL applications are prevalent for the ﬁling and management of legal and normative documents and those of CRM and QLN are used to increase citizens’ participation and the dissemination of information. In the medical and pharmaceutical sectors, applications of Competitive Intelligence and Technology Watch are widespread for the analysis, classiﬁcation and extraction of information from articles, scientiﬁc abstracts and patents. The main applications of Information Technology and the Internet concern natural language queries, above all in search engines within the website, multilingual corpora processing for information retrieval purposes independent of language used, and document management. A sector in which several types of applications are widely used is that of the telecommunications and service companies: the most important objectives of these industries are that all applications ﬁnd an answer, from market analysis to human resources management, from spelling correction to customer opinion survey.

4 Type of Functions and Objectives 4.1 Text Mining Applications in Knowledge Management and Human Resources Competitive Intelligence (CI) The need to organise and modify their strategies, according to demands and to the opportunities that the market present, requires that companies collect information about themselves, the market and their competitors, and to manage enormous amount of data, and analysing them to make plans. The aim of Competitive Intelligence is to select only relevant information by automatic reading of these data (http://www.scip.org/; [26]). Once the material has been collected, it is classiﬁed into categories to develop a database and, analysing the database, to get answers to speciﬁc and crucial information for company strategies. The typical queries concern the products, the sectors of investment of the competitors, the partnerships existing in markets, the relevant ﬁnancial indicators, and the names of the employees of a company with a certain proﬁle of competences. In some cases the introduction of TM substitutes already existing systems, as in the case of Total [2, 3] where, before the introduction of TM, there was a division that was entirely dedicated to the continuous monitoring of information (ﬁnancial, geopolitical, technical and economic) and answering the queries coming from other sectors of the company.

38

S. Bolasco et al.

In these cases the return-on-investment by the use of TM technologies was self evident when compared to results previously achieved by manual operators. In some cases, if a scheme of categories is not deﬁned a priori, clusterisation procedures are used to classify the set of documents considered relevant with regard to a certain topic, in clusters of documents with similar contents. The analysis of the key concepts present in the single clusters gives an overall vision of the subjects dealt with in the single texts. A good example of this is given by the IBM [28] corporation, realized with Online Analyst: in order to answer the needs of the sales department, all the documents regarding call centres were extracted from the company’s document base and classiﬁed into 30 groups, the features of which are visualised with the list of key concepts characterising them. The software used permitted the creation of some summary tables, which in the example in question gives the names and the number of times the competitors are mentioned, the names of other companies involved in the market, the principal partnerships among companies and the list of information technology services identiﬁed in the documents with the relative occurrences. The cluster analysis procedures were also used to deﬁne a classiﬁcation system into which new documents could be added at any time. This is the case, for example, of the pharmaceutical company GlaxoSmithKline [4, 10] which, using the SAS Text Miner on a sample of PubMed articles, identiﬁed some macro-categories into which to classify the references to their own and their competitors’ pharmaceutical products. This system of categories was later used for a new set of data, classiﬁed by means of diﬀerent deterministic methods (neural networks and regression models), reaching extremely high reliability levels. Another important case of CI was the assessment of hospitals and the work of hospital doctors carried out by researchers of the University of Louisville (Kentucky, USA) on the relationship between diagnosis and medical prescriptions [5, 7]. By using Text Miner, the researchers analysed the pharmaceutical orders together with the information on the medicine’s code number, the name of the medicine, the diagnosis and the names of the doctors who made out the prescription. Cluster analysis processing made it possible to identify diﬀerent groups in which to recognise common medical practice or in anomalous cases, a wrong prescription or particularly innovative treatment. The natural result of this study was the creation of predeﬁned lists to verify the relation between diagnosis and prescriptions. Another study, on the other hand, took into consideration the nine columns of the UB-92 module – the instrument supplied by Medicare for reimbursement applications – dedicated to the indication of secondary pathologies (most of all complications following pharmacological or therapeutic treatment). The modalities of the calculation of the degree of risk are however not always uniform. Furthermore, not all hospitals tend to make the calculation modalities public. In addition to this problem, the assigning of the ICD-9 codes to patients takes place, as a rule, on the basis of only the

Understanding Text Mining: A Pragmatic Approach

39

indications deﬁned by the medical staﬀ completing the case history. Since the medical staﬀ is not obliged to ﬁll in the records describing all the complications or secondary pathologies in detail, and as the applications for reimbursement made to the national health services are based on the consideration of just the information describing the main diagnosis and the consequent treatment used, there are signiﬁcant problems of accuracy in the analyses carried out by Medpar3 . The analysis performed on the data supplied by the Kentucky Hospital Association, referring to 28,000 cases of cardiovascular pathologies operations, have singled out some ineﬃciencies in the evaluation of certain risk factors. The use of text mining has therefore made it possible to demonstrate – for the State of Kentucky – some of the problems deriving from the adoption of a non-uniform procedure for the estimate of the degree of excellence in hospitals. Extraction Transformation Loading (ETL) Extraction Transformation Loading are aimed at ﬁling non-structured textual material into categories and structured ﬁelds. The search engines are usually associated with ETL that guarantee the retrieval of information, generally by systems foreseeing conceptual browsing and questioning in natural language. The applications are found in the publishing sector, the juridical and political document ﬁeld and medical-healthcare. The case studies presented refer to big European, and American publishing groups, placed together by the need to ﬁle their documents and to facilitate the retrieval of documents by advanced search engines permitting conceptual search. The use of complex semantic networks (thesaurus, ontologies), in fact, makes it possible to extract Internet and/or Intranet documents not only by means of keywords (which the traditional full text indexing search engines are limited to doing), but also concepts, that is, subjects or entities for which it is possible to deﬁne synonyms or relations [19]. For the ﬁling, usually complex systems of taxonomies are deﬁned to interact with automatic tools. In the legal documents sector the document ﬁling and information management operations deal with the particular features of language, in which the identiﬁcation and tagging of relevant elements for juridical purposes is necessary and the normalisation of normative references (e.g. to make the word “art.” equivalent to that of “article”) [5]. In the case of legal documents, the principal aim of the ﬁling procedures is the optimisation of those of search and information retrieval. In the healthcare sector, the experience of the NHS medical centre of Modena [6], is indicative as it was aimed at the sharing of knowledge in complex systems in which people with diﬀerent professional experience operate. An 3

Medpar (Medicare Provider Analysis and Review; http://www.cms.hhs.gov/) is the American national health service.

40

S. Bolasco et al.

integrated knowledge system was realized, with the Cogito [7, 8] software developed by Expert System, by which non-structured documents of diﬀerent types and formats are harmonised and put into one single database, which can be accessed by doctors or call centre operators. Human Resources Management (HR) TM techniques are also used to manage human resources strategically, mainly with applications reading and storing CVs for the selection of new personnel, as well as aiming at analysing staﬀ’s opinions, monitoring the level of employee satisfaction. The solution streamlined by CV Distiller (Koltech) for Cr´edit Lyonnais made it possible to automatically store all the CVs being sent to the company, despite the source (forms on Internet, mail, paper documents), identifying all the relevant information for successive searches of personnel. In the future this tool will permit the analysis of CVs drafted in diﬀerent languages simultaneously and it will be possible to make the information uniform, and then bring it back to a standard that is useful for further searches for competences (for example, the corresponding qualiﬁcations in countries with diﬀerent education systems [13]). In the context of human resources management, the TM techniques are often utilized to monitor the state of health of a company (level of motivation of its employees) by means of the systematic analysis of informal documents. A good example of this is the case of ConocoPhilips [29], a fast-moving American company, which developed an internal system – the VSM (Virtual Signs Monitor) – able to ﬁnd the intangible but crucial aspects of company life , the degree of experience and knowledge and the “productive” abilities. The approach chosen by Conoco was that of “measuring” the company mood by means of the indicators suggested by Sumantra Ghoshal’s theory [2] of “The Individualized Corporation”, which contrasts a new model based on completely diﬀerent pillars, like stretch, discipline, trust and reciprocal support with the traditional managerial model founded on concepts of constraint, contract, control and compliance. This managerial model, according to Ghoshal’s formulation encourages the cooperation and collaboration between the elements of an organisation, improving their results. The collaboration with Temis enabled Conoco to reﬁne its system for the monitoring of textual sources like e-mails, internal surveys of employees’ opinions, declarations of the management, internal and external chat lines, all representing important means for sounding the evolution of company culture. The morpho-syntactic and semantic analysis made it possible to relate the occurrences of certain expressions present in the textual sources with one (or more) of the indicators suggested by Ghoshal and representative of the two contrasting managerial models (“Organization Man” and “Individualized Corporation”).

Understanding Text Mining: A Pragmatic Approach

41

4.2 Text Mining Applications in Customer Relationship Management and Market Analysis Customer Relationship Management (CRM) In CRM domain the most widespread applications are related to the management of the contents of clients’ messages. This kind of analysis often aims at automatically re-routing speciﬁc requests to the appropriate service or at supplying immediate answers to the most frequently asked questions. With reference to the Italian market, the sectors in which customer care seems to be most developed are banks, insurance companies and institutions, in which the civic networks4 have recently been developed. The need to give quick answers to potential and existing customers is particularly felt in the insurance sector, above all following the recent spread of companies that deliver services exclusively by phone. Among the diﬀerent cases identiﬁed, one of the most interesting in Italy is that of the Linear Assicurazioni [9], which has adopted an automatic support system for the call centre operators. One of the ﬁrst European civic networks was Iperbole, the website of the municipality of Bologna, founded at the same time as the most famous digital city of Amsterdam. Iperbole is a real electronic notice board giving citizens information supplied by diﬀerent sources (local authorities, ﬁrms, social bodies, citizens associations), enabling all the members of the urban community to participate in public debates on local topics or to communicate (by electronic post) with other members of the community and with the promoters of the network themselves. In order to encourage participation in the network and the transparency of services, the municipality of Bologna streamlined the system Municipality Voyager (Omega Generation), a sorter of queries posed by the citizens and sent to the local council [10]. With MV, the user make their queries, entrusting themselves completely to the automatic system which analyses it and suggests one or more competent oﬃces to send it to. The user can choose anyone that in their opinion are most appropriate to send their queries to, and they receive a protocol number that will accompany the course of the query. At the end of the procedure, the level of satisfaction is ascertained along with other information of a statistical type collected automatically. Market Analysis (MA) Market Analysis, instead, uses TM mainly to analyse competitors and/or monitor customers’ opinions to identify new potential customers, as well as 4

The phenomenon of civic networks has become particularly important since the early ’90s, when some municipalities – in an eﬀort to strengthen their public and institutional role – experimented new methods of participation for the citizens [6].

42

S. Bolasco et al.

to determine the companies’ image through the analysis of press reviews and other relevant sources. For many companies tele-marketing and e-mail activity represents one of the main sources for acquiring new customers. The Italian company Celi developed InfoDiver [11], a tool used by various companies which, by visiting the list of target sites, analyses the contents and automatically produces a list containing the names (with address, telephone number, email) of all the companies answering the required criteria. The TM instrument makes it possible to present also more complex market scenarios. This is the case, for example, of the consortium Telcal (Telematica Calabria) , set up by the Region of Calabria and some telecommunications companies with the aim of promoting innovation of the whole Region with the construction of a capillary telematic network over the territory [12, 13, 27]. The Consortium started the development of an application called Market Intelligence System – MIS, an “intelligent” software – developed with the contribution of Temis and based on automatic text reading – permitting the small and medium companies of the Region to discover, according to their own peculiarities and without any prior indication, in which parts of the world market to promote their products. Starting with the screening of a careful selection of websites and thousands of articles coming from about 3,000 sources of the international press, chosen for speciﬁc topics of interest and periodically downloaded into a database, the system makes it possible to ﬁlter the information and to obtain answers to precise queries by the users, strategically supporting their marketing activity. For example, searching for information to understand how to promote the tourist demand of some operators even in the winter months, 1,500 documents were identiﬁed and then grouped together into 25 clusters. One of these clusters proved to be particularly interesting as it revealed a tourist movement, in the winter season, of elderly people from Scandinavia to the hotels of Florida, organised by certain tour operator. Having obtained this information, an example of ad hoc strategy could be to contact the operator in question to encourage the oﬀer of packages from Scandinavia to Calabria, stressing some of the competitive advantages of the Region (a climate that is more suitable for middle aged people with respect to Florida, rich sea and land fauna, the presence of numerous gastronomical specialities). 4.3 Text Mining Applications in Technology Watch (TW) The technological monitoring, which analyses the characteristics of existing technologies, as well as identifying emerging technologies, is characterised by two elements: the capacity to identify in a non-ordinary way what already exists and that is consolidated and the capacity to identify what is already available at an embryonal state, identifying through its potentiality, application ﬁelds and relationships with the existing technology.

Understanding Text Mining: A Pragmatic Approach

43

The case studies show that, as in the example of the German company European Molecular Biology Laboratory, speciﬁc statistical techniques are applied to represent – by means of factorial techniques – the concepts and scientiﬁc topics around which are organised the documents being analysed. The result of this application is a search engine that is useful to identify and create search keys for documents belonging to similar subject contexts [14, 22]. The documents (for example, the abstracts extracted from PubMed or Medline) are analysed and automatically classiﬁed and the data coming from the statistical procedures of data reduction become the base for the creation of the classiﬁcation rules. The end users of the system receive updating on the new publications and the following topics of research, the result of a selection of 2,000 new abstracts a day from databases like PubMed/Medline. Another interesting case is that of the scientiﬁc park of Trieste (IT) AREA Science Park [15, 17], engaged in the valuation of research and the transfer of innovation to production. Synthema developed a search and patent classiﬁcation engine, able to download – from public and private databases – the patents and scientiﬁc publications of potential interest to the user. The bibliographical references of each document, belonging to the same thematic domain, are ﬁrst of all normalised and then imported into the system. Then a multilingual tool (Italian, English, French and German) allows the operator to automatically extract the key information from the texts, to index and memorise them in the database. The users have access to the search service by means of a simple Internet browser, that enables them to carry out the search in the database both by key words and by a special mask that reproduces the structure-type of the document. The output of the search consists either in the extraction of single documents to consult or in their classiﬁcation on a conceptual or thematic basis, by means of which the user can later reﬁne their search according to their needs. 4.4 Text Mining Applications in Natural Language Processing (NLP) and Multilingual Aspects Questioning in Natural Language The most important case of application of the linguistic competences developed in the TM context is the construction of websites that support systems of questioning in natural language. For example, the conceptual meta-search engine adopted for the website of the Italian Government [16, 17, 18], developed by Expert System, which has the ability to recognise and interpret natural language, thus simplifying the search process for information on the part of the users. The need to make sites cater as much as possible for the needs of customers who are not necessarily expert in computers or web search is common also to those companies that have an important part of their business on the web. A good example of this is the French company La Redoute, an on-line

44

S. Bolasco et al.

commerce company that saw a noticeable increase in its orders following the introduction, in 2001, of a search engine that supports queries in natural language, with the adoption of the software iCatalog (Sinequa) [19]. With the iIntuiton technology, also produced by Sinequa, search engines were used with the possibility of language questioning for the French site AlloCin´e [20] dedicated to cinema and for Leroy Merlin, one of the most important companies in the ﬁeld of DIY and building materials [21]. Multilingual Applications In NLP, Text Mining applications are also quite frequent and they are characterised by multilinguism. Besides the examples that have already been quoted in this section, the identiﬁcation and retrieving system of web pages in different languages of the American company Inktomi [22] and a multilingual search engine of the company Verity represent further applications [23]. Inktomi, which manages a ﬁle of websites to which search engines of the world refer to, experimented Text Mining to identify and analyse web pages published in diﬀerent languages, by means of the Inxight LinguistX R tool, an engine of natural language processing (produced by the Platform American company Inxight). Verity is a Californian company that deals with the design and construction of company portals and knowledge management systems able to automatically manage information by means of advanced search systems. Through the phases of text parsing (normalisation, segmentation, lemmatisation and decompounding) the Verity search engine permits more eﬃcient retrieving operations, as a result of the ability to manage documents drafted in 10 languages. This has enabled the Californian company to get onto the international market, from which it was precluded beforehand. Voice Recognition Some companies have also had an interesting experience in the ﬁeld of the processing and voice recognition of dictated recordings. In the medical sector, to quote the two most relevant cases, the Ospedale Aziendale of Merano [24] and the Ospedale Unico Versilia-Lido Di Camaiore [25] used a voice recognition system (Medical Voice Suite, produced by Synthema), which is also multilingual and useful for the management and ﬁlling in of medical reports. Before the introduction of this technology, the medical reports were ﬁlled in by the doctors, but were only available some days following the actual reporting. The solution adopted makes it possible to speed up the production of medical reports which, by means of voice recording, can now be dictated, corrected, printed and signed by the doctors during the interval between one operation and another, both directly in the operating theatre and from their oﬃces.

Understanding Text Mining: A Pragmatic Approach

45

The secretaries correct the reports, listening to the audio synchronised with the text and making the necessary changes. The doctor can dictate the text ﬂuently, in Italian (and in German in the case of Merano), using both the medical dictionary and the specialist documents (reports, experts’ reports, clinical records etc.). Furthermore, as new documents are dictated the voice recognition improves and the errors decrease. The use of the personal dictionary, to which up to 64,000 new words can be added, contributes to optimization of the system’s performance. The saving and ﬁling of the documents takes place automatically by the typing in of the patient’s details by the doctor. The system only needs an initial training phase, to memorise each diﬀerent “way of speaking” of the personnel (acoustic, intonation and timbre). The regular use of software has enabled the doctors to overcome the fear and the initial prejudices towards something “new”, as a result also of the high levels of performance achieved by the system. By using this system the objectives of reducing and speeding up of report production has been reached (completed on the same day as the operation), without the need of additional personnel.

5 A Model for Selection Strategy Thinking more generally at what are the key elements that typify every case study that has been taken into account, it should be said that everyone of them seem to be characterized by two main aspects, which strictly depend on the type of users’ application: 1. the level of customisation/personalisation followed in the tools development (using ontologies, rules and dictionaries), that could be necessary for speciﬁc domain but not for others; 2. the level of integration, user-developer in achieving the task, that depends either on how much important it is to co-operate to reach objectives or whether the user just wants to know the content of “black box”. The level of integration ranges from pre-packed and ready-to-use solutions to those tailor-made and extremely personalized. Combining the low and high levels of the two variables integration and customisation, it presents four diﬀerent situations (see Table 2) as described below: A. Standardisation (low integration – low customisation). In these cases the users achieve their needs with the help of a ready-to-use solution, without developing internally or externally any kind of resources. This solution is the cheapest compared to other situations, but the quality of the results may not be as good. An example of low integration and low personalized application is the implementation of a search engine with semantic browsing functionalities of the

46

S. Bolasco et al. Table 2. Solution Selection Strategy Model

low Integration

high

Low A Standardisation C Internalisation

Customisation High B Outsourcing D Synergism and Partnership

Le Monde’s web site [26]. This engine presents the pertinent documents by automatic classiﬁcation based on the extraction of correctly formed nominal groups from the text. Based on the iIntuition technology (by Sinequa), the engine operates by default in diﬀerent modalities (lexical, linguistic, semantic, mathematic, etc.), but no speciﬁc linguistic resources has been developed for it. Another example is in the use of cluster analysis in the Hewlett-Packard’s case study [27, 28]. After the merger with Compaq in the year 2002 HewlettPackard faced the problem of creating a single hierarchy in which to classify the two diﬀerent companies’ line of products. As a source HP used more of 700 GB of documents concerning its products, at diﬀerent level of granularity. With the use of a TM tool and a pre-deﬁned hierarchical TM strategy (data acquisition, data pre-processing, data reduction and data analysis) suggested by the software used (Text Miner TM by SAS), HP has created a 25 macro-class model in which to classify each product (the model is monthly updated). B. Outsourcing (low integration – high customisation): this is the case in which the developer is totally responsible for the project, developing externally the tools and later implementing them within the user software interface. Examples, among those cited above, are the automatic routing service of messages implemented on the website of the Municipality of Bologna, or the multilingual services for the identiﬁcation and retrieving of web pages in diﬀerent languages [29], implemented by Inktomi. C. Internalisation (high integration – low customisation) is quite common when the user prefers – for the type of application or the sector of activity – to develop internally some resources (for example the complex dictionaries used for knowledge extraction in pharmaceutical ﬁeld, as in the case of NHS medical centre of Modena). D. Synergism and partnership (high integration – high customisation) represents the best solution in term of achievable results. For high quality and eﬃciency, an implementation based on full synergism and partnership between developer and user could be really an expensive option in term of time and money invested. For this reason, this solution is generally suitable for the companies that have enough ﬁnancial and human resources available at its disposal to under take this task, as in the case of IBM presented in the section of Competitive Intelligence.

Understanding Text Mining: A Pragmatic Approach

47

If our pragmatic approach is able to picture the nearest-to-reality development of TM applications, it presents actually some limitations, mostly concerning the type and quantity of information available. Firstly, the problems of heterogeneity of the material collected, often characterised by a publicity appeal (especially on the companies’ web sites) or evasive and unhelpful feedback for research purposes, because of the conﬁdentiality of certain applications and the privacy demanded by the customers themselves. Secondly, the problems of incongruity in the availability of information sources supplied by the TM developers, which seems to depend, in a certain way, on the type and nature of companies. In fact, the smaller less-known companies oﬀering TM solutions are more willing to put a greater amount of information on their websites, thus promoting themselves and the TM tools developed as their core business. On the contrary, the big companies are less eager to spread their TM experience, probably because TM is one of the service (but not the most important) provided by their data mining or statistical tools. In this second case, international fame and the importance of the developers has made it possible to collect the necessary information from other sources, especially those focused on Business Intelligence and Information Technology. Despite these limits, the joint analysis of the diﬀerent case studies gives an adequate picture of the state-of-the-art of TM applications, allowing us to aﬃrm that in coming years linguistic resource and multilinguism will be crucial in this domain and “on line” service institutions will be increasing tremendously. Another problem is how to establish the return-on-investment by the use of TM technologies. This is evident only when TM applications replaces human operators, thus saving on costs, furthermore for long time there wont be a program that can fully interpret text let alone text and numbers. Probably, for the future successful development of TM, as a consequence of its present increasing dissemination, it will be necessary for all the players involved to make an additional eﬀort to provide a clear picture of TM tools and its beneﬁts to businesses at large. This need should concern speciﬁcally the certiﬁcation of TM processing in the companies’ TM tools, in order to give the customers the chance to evaluate the statistical reliability of these tools, with the help of commonly known quality indicators (ex. index of homogeneity in cluster analysis). TM tools could become the sensory organs of a business in future and that future is not far oﬀ.

References 1. Balbi, S., Bolasco, S., Verde, R. (2002), “Text mining on elementary forms in complex lexical structures”, in JADT 2002. Actes des 6es journ´ ees internationales d’analyse statistique de donn´ ees textuelles, Saint-Malo, IRISA, pp. 89– 100.

48

S. Bolasco et al.

2. Bartlett, C.A., Ghoshal, S. (1998) The Individualized Corporation: a Fundamentally New Approach To Management, Heinemann, London. 3. Bolasco, S., Baiocchi, F., Canzonetti, A., della Ratta-Rinaldi, F., Feldman, A. (2004), “Applications, sectors and strategies of Text Mining: a ﬁrst overall picture”, in S. Sirmakessis (ed.) Text Mining and its Applications, Springer Verlag, Heidelberg, pp. 37–52. 4. Bolasco, S., Bisceglia, B., Baiocchi, F. (2004), Estrazione automatica di informazione dai testi, Mondo Digitale, Vol. 3, No. 1, pp. 27–43. 5. Bolioli, A., Dini, L, Mercatali, P., Romano, F. (2002), “For the Automated Mark-up of Italian Legislative Texts in XML”, Fifteenth Annual International Conference on Legal Knowledge and Information Systems Conference Proceedings, December 2002, Institute of Advanced Legal Studies, London. 6. Castells, M. (2002), Galassia Internet, Feltrinelli, Milano. [orig, 2001, Internet Galaxy, Oxford University Press] 7. Cerrito, P. B., Badia, A. (2003), The Application of Text Mining Software to Examine Coded Information, SIAM International Conference on Data Mining, Philadelphia. 8. Collica, R. S. (2003), Mining textual data for CRM applications, DM Review, February 2003. 9. Dini, L., Mazzini, G. (2002), “Opinion classiﬁcation through information extraction” in A. Zanasi, C.A. Brebbia, N.F.F. Ebecken and P. Melli (eds), Data Mining III, WIT Press, pp. 299–310. 10. Dulli, S., Rizzi, A., Patarnello, F., Frizzo, V. (2004), “Scientiﬁc and pharmaceutical documentation management: a Text Mining approach in GlaxoSmithKline”, Proceedings of Text Mining for Business Intelligence – Nemis Annual Conference, Universit` a di Roma “La Sapienza”, Rome, pp. 34–36. 11. Feldman, R., Dagan, I. (1995), “Knowledge Discovery in Textual Databases”, Proceedings of KDD-95, pp. 112–117. 12. Feldman, R., Dagan, I., Hirsh, H. (1998). Mining text using keyword distributions. Journal of Intelligent Systems, 10, pp. 281–300. 13. Gire, F., Kolodziejczyk, S. (2004), “Automatic analysis of curriculum vitae, a case study: the CV Distiller software”, Proceedings of Text Mining for Business Intelligence – Nemis Annual Conference, Universit` a di Roma “La Sapienza”, Rome, pp. 27–32. 14. Lebart, L., Salem, A., Berry, L. (1998), Exploring Textual Data, Kluwer Academic Publishers, Dordrecht-Boston-London. 15. Mani, I., Maybury, M. T. (eds.), (1999), Advances in automatic text summarization. The MIT Press, Cambridge Massachussetts. 16. Nasukawa, T., Nagano, T. (2001), Text analysis and knowledge mining system, IBM System Journal, Vol. 40, No. 4. 17. Neri, F., Raﬀaelli, R. (2002), “Text Mining, recuperare l’informazione nascosta”, Proceedings of Conference TIPI – Tecnologie Informatiche nella Promozione della lingua Italiana, Fondazione Ugo Bordoni, Ministero delle Comunicazioni , June 2002, Rome. 18. Neri, F., Raﬀaelli, R. (2004), Text Mining applied to multilingual corpora, (in this volume). 19. Pazienza, M.T., Vindigni, M. (2003), “Agents Based Ontological Mediation in IE Systems”, in Pazienza, M.T. (ed.), Information Extraction in the Web Era. Lecture Notes in Artiﬁcial Intelligence 2700. Springer Verlag, Berlin-Heidelberg, pp. 92–128.

Understanding Text Mining: A Pragmatic Approach

49

20. Poibeau, T. (2003), Extraction Automatique d’Information: du texte brut au web semantique, Hermes – Lavoisier, Paris. 21. Rajman, M., Vesely, M. (2004), “From text to knowledge: document processing and visualization. A text mining approach”, in S. Sirmakessis (ed.) Text Mining and its Applications, Springer Verlag, Heidelberg, pp. 7–24. 22. Reincke, U. (2003), “Proﬁling and classiﬁcation of scientiﬁc documents with SAS Text MinerTM ”, Workshop des GI-Arbeitskreises “Knowledge Discovery” (AK KD) University of Oldenburg, October 2003, Karlsruhe. 23. Salton, G. (1989), Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley. 24. Sebastiani, F. (2002), Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No. 1, pp. 1–47. 25. Sullivan, D. (2001), Document Warehousing and text mining. Techniques for improving business operations, Marketing and sales, Wiley, New York. 26. Zanasi, A. (2001), “Text Mining: the new Competitive Intelligence Frontier. Real cases in industrial, banking and telecom/SMEs world”, VSST2001 Conference Proceedings, Barcellona. 27. Zanasi, A. (2002a), “L’analisi dei testi nel CRM analitico”, ComputerWorldOnline, http://www.cwi.it/showPage.php?template=rubriche&id=11776. 28. Zanasi, A. (2002b), “Text Mining: Competitive and Customer intelligence in real business cases”, IntEmpres2002 Conference Proceedings, Instituto de Informaci´ on Cient´ıﬁca y Tecnol´ ogica (IDICT), LaHabana, October 2002. 29. Zanasi, A. (2004), “Temis Insight Discoverer and Online Miner in a US CRM case and on Italian government data”, Proceedings of Text Mining for Business Intelligence – Nemis Annual Conference, Universit` a di Roma “La Sapienza”, Rome, pp. 21–26.

Web Site References 1. http://nemis.cti.gr/Public%20Deliverables/Forms/AllItems.htm 2. http://www.temis-group.com/ (see “Clients” page: “Insight DiscovererTM Extractor for eﬀective Competitive Intelligence. Total Business Case”) 3. http://solutions.journaldunet.com/0312/031215 total temis.shtml (Deblock F., “Total appuie sa veille sur un extracteur de texte”, Article from “Le Journal du Net”, December 2003 ) 4. http://geostasto.eco.uniroma1.it/nemis/Patarnello&oth.PDF 5. http://www.hpcwire.com/dsstar/04/0106/107206.html (“SAS Text Miner Digs Into Unstructured Text, Cuts Costs”, in DStar, Vol. 8 No. 1, January 6/2004) 6. http://www.expertsystem.it/customers/eng ausl.htm 7. http://www.expertsystem.it/pdf/brochure/Cogito.pdf, 8. http://www.expertsystem.it/pdf/white%20paper/ita Cogito.pdf 9. http://www.expertsystem.it/customers/eng linear.htm 10. http://www.comuni.it/ndacomuni/articolo.php?idart=18 11. http://www.text-mining.it/market/market case.htm 12. http://www.pubit.it/sunti/euc0408ab.html 13. http://www.temis-group.com/ (see “Clients” page) 14. http://www.daviddlewis.com/events/otc2003/SAS TM Paper.pdf

50

S. Bolasco et al.

15. http://www.synthema.it/documenti/A%20new%20way%20to%20explore%20 patents%20databases.pdf 16. http://www.expertsystem.it/customers/ita presidenzaCdM.htm 17. http://www.expertsystem.it/releases/img 339.gif (“Tecnologia Linguistica per servizi realmente utili”, Article from “Informatica ed Enti Locali” November/December 2002) a il look 18. http://www.expertsystem.it/releases/img 351.jpg (“Palazzo Chigi si rif` on line”, Article from “.com”, November 2002 ) 19. http://www.sinequa.com/html/article-63.html 20. http://www.sinequa.com/html-uk/allocine-en.html 21. http://solutions.journaldunet.com/0106/010608 merlin.shtml (Crochet Damais A., “Avec Sinequa, Leroy Merlin opte pour un moteur de recherche en langage naturel“, Article from “Le Journal du Net”, June 2001) 22. http://www.inxight.com/customers/success stories.php#inktomi 23. http://www.inxight.com/pdfs/verity.pdf 24. http://www.synthema.it/documenti/OrtopediaMerano%20v3.5.pdf 25. http://www.synthema.it/documenti/RadiologiaVersilia%20V1.3.pdf 26. http://www.adae.gouv.fr/IMG/pdf/sinequa dossier adae.pdf (De Loupy C., “L’apport de connaissances linguistiques en recherche documentaire”) 27. http://www.sas.com/success/hp.html 28. http://www.computerworld.com/databasetopics/data/story/0,10801,80228,00. html 29. http://www.motoridiricerca.it/inktomi.htm

Novel Approaches to Unsupervised Clustering Through k-Windows Algorithm D.K. Tasoulis and M.N. Vrahatis Computational Intelligence Laboratory, Department of Mathematics, University of Patras Artiﬁcial Intelligence Research Center (UPAIRC), University of Patras, GR–26110 Patras, Greece. Summary. The extraction of meaningful information from large collections of data is a fundamental issues in science. To this end, clustering algorithms are typically employed to identify groups (clusters) of similar objects. A critical issue for any clustering algorithm is the determination of the number of clusters present in a dataset. In this contribution we present a clustering algorithm that in addition to partitioning the data into clusters, it approximates the number of clusters during its execution. We further present modiﬁcations of this algorithm for diﬀerent distributed environments, and dynamic databases. Finally, we present a modiﬁcation of the algorithm that exploits the fractal dimension of the data to partition the dataset.

1 Introduction Clustering is a fundamental ﬁeld of explanatory data analysis that aims at discovering hidden structure in datasets. More speciﬁcally, clustering partitions a set of objects in groups (clusters) such that objects within the same group bear a closer similarity to each other, than objects in diﬀerent groups. Clustering techniques have a very broad application domain including data mining [22], statistical data analysis [2], compression and vector quantization [40], global optimization [8, 54], web personalization [41] and text mining [19, 45]. The ﬁrst comprehensive foundations of these methods was published in 1939 [55], but the earliest references date back to the fourth century B.C. by Aristotle and Theophrastos and in the 18th century to Linnaeus [30]. Following [1], to deﬁne more formally the clustering problem ﬁrstly, we assume that S is a set of n points in a d–dimensional metric space (Rd , ). A k-clustering of S for an integer k n is deﬁned as a partition Σ of S into k subsets S1 , . . . , Sk , each one representing a diﬀerent cluster. The size of a cluster Si is deﬁned as the maximum distance, under the metric, between a ﬁxed point ci , called the center of the cluster, and any other point of

Corresponding author: M.N. Vrahatis, email: [email protected]

D.K. Tasoulis and M.N. Vrahatis: Novel Approaches to Unsupervised Clustering Through kWindows Algorithm, StudFuzz 185, 51–77 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

52

D.K. Tasoulis and M.N. Vrahatis

Si . Similarly, the size of a k-clustering Σ, is deﬁned as the maximum cluster size among all the clusters in Σ [1]. The k-center problem is deﬁned as the computation of a k-clustering of the smallest possible size. The k-center problem can also be formulated as covering S by congruent disks of the smallest possible size. Sometimes the centers of the clusters are required to be a subset of S. This requirement deﬁnes the discrete k-center problem. In some applications the number of points in each cluster is also important. Thus, if we deﬁne, for an integer L > 0, the L-capacitized k-clustering of S to be a partition Σ of S in k clusters with no cluster containing more than L points; then the L-capacitized k-center problem is deﬁned as the computation of the L-capacitized k-clustering having the smallest possible size [1]. Clustering is a diﬃcult scientiﬁc problem, since even the simplest clustering problems are known to be NP-Hard [1]. The Euclidean k-center problem in the plane is NP-Hard [33]. In fact, it is NP-Hard to approximate the twodimensional k-center problem even under the L∞ -metric [23]. Irrespectively of the method used, a fundamental issue in cluster analysis is the determination of the number of clusters present in a dataset. This issue remains an open problem in cluster analysis. For instance well–known and widely used iterative techniques, such as the k-means algorithm [24], require from the user the a priori designation of the number of clusters present in the data. To this end, we present the unsupervised k–windows clustering algorithm. This algorithm, by employing windowing techniques, attempts not only to discover the clusters but also their number, in a single execution. Assuming that the dataset lies in d dimensions, the algorithm initializes a number of d–dimensional windows (boxes), over the dataset. Subsequently, it iteratively moves and enlarges these windows in order to cover the existing clusters. The approximation of the number of clusters is based on the idea of considering a large number of initial windows. The windowing technique of the k-windows algorithm allows for a large number of initial windows to be examined, without a signiﬁcant overhead in time complexity. Once movement and enlargement of all windows terminate, all overlapping windows are considered for merging. The merge operation determines whether two windows belong to same cluster by examining the proportion of points in the overlapping area to the total number of points in each window. Thus, the algorithm is capable of providing an approximation to the actual number of clusters. Database technology has enabled organizations to collect data at a constantly increasing rate. The development of algorithms that can extract knowledge in the form of clustering rules from such distributed databases has become a necessity. Distributed clustering algorithms attempt to merge computation with communication and explore all facets of the distributed clustering problem. The k-windows algorithm can be extended to a distributed environment. Considering a non-stationary environment where update operations on the database are allowed, maintaining a cluster result at a low computational cost

Novel Approaches to Unsupervised Clustering

53

becomes important. Utilizing a recently proposed dynamic data structure, we present an extension of the k-windows algorithm suitable for cluster maintenance. An important property that describes the complexity of a dataset is its fractal dimension. Incorporating estimates of the fractal dimension in the workings of the k-windows algorithm, allows it to extract qualitative information for the underlying clusters. The rest of the paper is organized as follows. The details of the unsupervised k-windows are described in Section 2. Next, in Section 3 two distributed versions of the algorithm are presented. Section 4 presents an extension of the algorithm to non-stationary environments. A modiﬁcation of the algorithm that uses the fractal dimension is presented in Section 5. In Section 6 computational experiments are presented that demonstrate the applicability of the algorithm, on various datasets. The paper ends with concluding remarks in Section 7.

2 The unsupervised k-windows clustering algorithm The k-windows clustering algorithm aims at capturing all the patterns that belong to one cluster within a d–dimensional window [56]. To this end, it employs two fundamental procedures: movement and enlargement. The movement procedure aims at positioning each window as close as possible to the center of a cluster. During this procedure each window is centered at the mean of the patterns that are included in it. The movement procedure is iteratively executed as long as the distance between the new and the previous center exceeds the user–deﬁned variability threshold, θv . On the other hand, the enlargement process tries to augment the window to include as many patterns from the current cluster as possible. Thus, the range of each window, for each coordinate separately is enlarged by a proportion θe /l, where θe is user–deﬁned and l stands for the number of previous valid enlargements. Valid enlargements are those that cause a proportional increase in the number of patterns included in the window, exceeding the user–deﬁned coverage threshold, θc . Further, before each enlargement is examined for validity the movement procedure is invoked. If an enlargement for coordinate c 2, is considered valid, then all coordinates c , such that c < c , undergo enlargement assuming as initial position the current position of the window. Otherwise, the enlargement and movement steps are rejected and the position and size of the d–range are reverted to their prior to enlargement values. In Fig. 1 the two processes are illustrated. As previously mentioned, a critical issue in cluster analysis, is the determination of the number of clusters that best describe a dataset. The unsupervised k-windows algorithm has the ability to provide an approximation to this number. The key idea is to initialize a large number of windows. After movement and enlargement of all windows terminates, all overlapping

54

D.K. Tasoulis and M.N. Vrahatis (a)

(b)

M1 M2

E2 E1

M3 M4

M4

Fig. 1. (a) Sequential movements M2, M3, M4 of initial window M1. (b) Sequential enlargements E1, E2 of window M4.

windows are considered for merging. During this operation, for each pair of overlapping windows, the number of patterns that lie in their intersection is computed. Next, the proportion of this number to the total number of patterns included in each window is calculated. If this proportion exceeds a user deﬁned threshold, θs , the two windows are considered to be identical and the one containing the smaller number of points is disregarded. Otherwise, if the mean exceeds a second user deﬁned threshold, θm , the windows are considered to have captured portions of the same cluster and are merged. An example of this operation is exhibited in Fig. 2; the extent of overlapping of windows W1 and W2 exceeds the θs threshold, and W1 is deleted. On the other hand, windows W3 and W4 are considered both to belong to the same cluster. Finally, windows W5 and W6, are considered to capture two diﬀerent clusters. An example of the overall workings of the algorithm is presented in Fig. 3; In Fig. 3(a) a dataset that consists of three clusters is shown, along with six initial windows. In Fig. 3(b) after the merging operation the algorithm has correctly identiﬁed the three clusters. The computationally demanding step of the k-windows clustering algorithm is the determination of the points that lie in a speciﬁc window. This is the well studied orthogonal range search problem [38]. Formally this problem can be deﬁned as follows: (a)

(b) W1

(c)

W3

W5

W2 W4

W6

Fig. 2. (a) W1 and W2 satisfy the similarity condition and W1 is deleted. (b) W3 and W4 satisfy the merge operation and are considered to belong to the same cluster. (c) W5 and W6 have a small overlapment and capture two diﬀerent clusters.

Novel Approaches to Unsupervised Clustering

(a)

55

Cluster 3

(b) Cluster 1

Cluster 2

Fig. 3. An example of the application of the k-windows algorithm.

Input: (a) V = {p1 , . . . , pn } is a set of n points in Rd . (b) A d-range query Q= [a1 , b1 ] × [a2 , b2 ] × · · · × [ad , bd ] speciﬁed by (a1 , . . . , ad ) and (b1 , . . . , bd ), with aj bj . Output: Report all points of V that lie within the d-range Q. Numerous Computational Geometry techniques have been proposed to address this problem. All these techniques implicate a preprocessing stage at which they construct a data structure storing the patterns. This data structure allows them to answer range queries fast. In Table 1 the computational complexity of various such approaches is summarized. In detail, for applications of very high dimensionality, data structures like the Multidimensional Binary Tree [38], and Bentley and Maurer [10] seem more suitable. On the other hand, for low dimensional data with a large number of points the approach of Alevizos [3] appears more attractive.

Method Multidim. Binary Tree [38] Range Tree [38] Wilard and Lueeker [38] Chazelle [15] Chazelle and Guibas [16] Alevizos [3] Bentley and Maurer [10] Bentley and Maurer [10] Bentley and Maurer [10]

Preprocessing Time Space θ (dn log n) ` ´ O `n logd−1 n´ O n logd−1 n ` ´ O n logd−1 n ` ´ O `n logd+1 n´ O `n logd−1 ´ n O `n2d−1´ O n1+ε O (n log n)

θ (dn) ` ´ O `n logd−1 n´ O “n logd−1 n” d−1 n O n log ` log dlog ´n O `n log n ´ O `n logd−1 ´ n O `n2d−1´ O n1+ε O (n)

Query time ” “ O s + dn1−1/d ` ´ O `s + logd n ´ O s + logd−1 n ` ´ O s + logd−1 n ` ´ O `s + logd−2 n´ O s + logd−2 n O (s + d log n) O (s + log n) O (nε )

Table 1. Methods for orthogonal range search with the corresponding time and space complexity (n is the number of points, d is their dimension and s is the result of the query).

56

D.K. Tasoulis and M.N. Vrahatis

Based on the above discussion we propose the following high level description of the algorithm: Unsupervised k-windows clustering algorithm a, θe , θm , θc , θv ,k execute W =DetermineInitialWindows(k,a) for each d–range wj in W do repeat execute movement(θv ,wj ) execute enlargement(θe ,θc ,θv ,wj ) until the center and size of wj remain unchanged execute merging(θm ,θs ,W ) Output clusters cl1 , cl2 , . . . such as: cli = {i : i ∈ wj , label(wj ) = li}

function DetermineInitialWindows(k,a) initialize k d–ranges wm1 , . . . , wmk each of size a select k random points from the dataset and center the d-ranges at these points return a set W of the k d–ranges

function movement(θv , a d–range w) repeat ﬁnd the patterns that lie within the d–range w calculate the mean m of these patterns set the center of w equal to m until the distance between m and the previous center of w is less than θv function enlargement(θe ,θc ,θv ,a d–range w) repeat foreach coordinate di do repeat enlarge w across di for θe % execute movement(θv ,w) until increase in number of patterns across di is less than θc % until increase in number of patterns is less than θc % across every di

function merging(θm ,θs ,a set W of d–ranges) for each d–range wj in W not marked do mark wj with label wj if ∃ wi = wj in W , that overlaps with wj compute the number of points n that lie in the window overlapment if n/|wi | θs and |wi | < |wj | disregard wi if 0.5 (n/|wj | + n/|wi ) θm mark all wi labeled d–ranges in W with label wj

Novel Approaches to Unsupervised Clustering

57

3 Distributing the clustering process The ability to collect, store and retrieve data has been constantly increasing throughout the past decades. This fact has rendered the development of algorithms that can extract knowledge in the form of clustering rules from various databases simultaneously, a necessity. This trend has been embraced by distributed clustering algorithms, that attempt to merge computation with communication and explore all facets of the distributed clustering problems. Although several approaches have been introduced for parallel and distributed Data Mining [13, 25, 28], parallel and distributed clustering algorithms have not been extensively studied. In [58] a parallel version of DBSCAN [43] and in [18] a parallel version of k-means [24] were introduced. Both algorithms start with the complete data set residing in one central server and then distribute the data among the diﬀerent clients. For instance, in the case of parallel DBSCAN, data is organized at the server site within an R*tree [9]. The preprocessed data are then distributed among the clients, which communicate with each other via messages. Typically, in a distributed computing environment the dataset is spread over a number of diﬀerent sites. Thus, let us assume that the entire dataset X is distributed among m sites each one storing Xi for i = 1, . . . , m, so: X= Xi . i=1,...,m

Furthermore let us assume that there is a central site, C, that will hold the ﬁnal clustering results. At this point, diﬀerent assumptions can be considered for the nature of communication among the sites. Primarily we can consider that the sites are connected through a high speed network, and data disclosure is allowed. On the other hand, a diﬀerent assumption would enforce minimal communication among the sites. This could be due to privacy issues, or very slow and expensive network connections. In the following paragraphs two version of the k-windows algorithm will be presented for the two opposing assumptions. Each version takes under consideration the underlying restrictions of the environment, and tries to provide eﬃcient and eﬀective clustering results. 3.1 Distributed clustering for minimal communication environments Assuming an environment that enforces minimal communication, it is possible to modify the k-windows algorithm to distribute locally the whole clustering

58

D.K. Tasoulis and M.N. Vrahatis

procedure. In more detail, at each site i, the k-windows algorithm is executed over the Xi dataset. This step results in a set of d–ranges (windows) Wi for each site. To obtain the ﬁnal clustering result over the whole dataset X, all the ﬁnal windows from each site are collected to the central node C. The central node is responsible for the ﬁnal merging of the windows and the construction of the ﬁnal results. As it has already been mentioned in Section 2, all overlapping windows are considered for merging. The merge operation is based on the number of patterns that lie in the intersection of the windows. This version of the algorithm assumes that the determination of the number of patterns at each intersection between two windows may be impossible. For example each site might not want to disclose this kind of information about its data. Alternatively, the exchange of data might be over a very slow network that restricts the continuous exchange of information. Under this constraint, the proposed implementation always considers two overlapping windows to belong to the same cluster, irrespective of the number of overlapping points. The θm and θs parameters become irrelevant. A high level description of the proposed algorithmic scheme follows: Minimal communication distributed k-windows for each site i, with i = 1, . . . , m execute the k-windows algorithm over Xi send Wi to the central node C. At the central node C for each site i get the resulting set of d–ranges Wi set W ← W ∪ Wi {comment: d–range merging} for each d–range wj not marked do mark wj with label wj if ∃ wi = wj , that overlaps with wj then mark wi with label wj

3.2 Distributed clustering over a fast communication network Assuming that the sites involved in the distributing environment are connected through a fast network infrastructure, the algorithm can be modiﬁed to distribute the computational cost without imposing any restriction to its eﬃciency. More speciﬁcally, it is possible to distribute the computational eﬀort of the k-windows algorithm by only parallelizing the range queries. In detail, assume again m computer nodes are available, each one having a portion of the dataset Vi where i = 1, . . . , m. Firstly at each node, i, a multidimensional binary tree [38] Ti is constructed, which stores the points of the set Vi . Then parallel search for a range query Q is performed as follows:

Novel Approaches to Unsupervised Clustering

59

Parallel range search procedure set A ←− ∅ for each node i do in parallel set Ai ←− ∅ ﬁnd the points from the local database that are included in Q insert the recovered points in Ai send Ai to the server node set A←− A ∪ {A1 , . . . , Am } The algorithm at a preprocessing step constructs a multidimensional binary tree for each node holding data known only to that node. Then a server node is used to execute the k-windows algorithm. From that point onwards, the algorithm continues to work as in the original version. When a range search is to be executed, the server executes the range query over all the nodes and computes the union of the results. To analyze the algorithms complexity, we assume that the multidimensional binary tree, is used as a data structure [38]. Then, the algorithmic complexity for the preprocessing step for n points in d dimensions is reduced to θ((dn log n)/m) from θ(dn log n) of the single node version. Furthermore the storage requirements at each node come up to θ(dn/m) while for the single node they remain θ(dn). Since the orthogonal range search algorithm has a complexity of O(d n1−1/d + s) [38], the parallel orthogonal range search algorithm has a complexity of O(d (n/m)1−1/d + s + (d, m)), where s is the total number of points included in the range search and (d, m) is a function that represents the time required for the communication between the master and the nodes. It should be noted that the only information that needs to be transmitted from each slave is the number of points found and their mean value as a d-dimensional vector. So the total communication comes to a broadcast message from the server about the range, and m messages of an integer and a d-dimensional vector from each slave. Taking these parameters under consideration, (d, m) can be computed for a speciﬁc network interface and a speciﬁed number of nodes. For the parallel algorithm to achieve an execution time speedup the following relation must hold: ⎛ 1 ⎞ n 1− d d m + s + (d, m) ⎠ 1, O⎝ 1 d n1− d + s which comes to [46]:

n 1− d1 1 O((d, m)) O d n1− d − . m As long the above inequality holds, the parallel version of the algorithm is faster than the single node version. In all other cases the network infrastructure presents a bottleneck to the system. In that case, the parallel version advantage is limited to storage space requirements.

60

D.K. Tasoulis and M.N. Vrahatis

4 Clustering on dynamic databases Most clustering algorithms rely on the assumption that the input data constitute a random sample drawn from a stationary distribution. As data is collected over time the underlying process that generates them can change. In a non–stationary environment new data are inserted and existing data are deleted. Cluster maintenance deals with the issues of when to to update the clustering result and how to achieve this with low computational cost. In the literature there are few maintenance algorithms, most of which are developed for growing databases. The application domain of these algorithms includes database re–organization [59], web usage user proﬁling [34] as well as, document clustering [57]. From the broader ﬁeld of data mining, a technique for maintaining association rules in databases that undergo insertions and deletions has been developed in [17]. A generalization algorithm for incremental summarization has been proposed in [21]. An incremental document clustering algorithm that attempts to maintain clusters of small diameter as new points are inserted in the database has been proposed in [14]. Another on-line star-algorithm for document clustering has been proposed in [6]. A desirable feature of the latter algorithm is that it imposes no constraints on the number of clusters. An incremental extension to the GDBSCAN algorithm [43] has been proposed in [20]. Using a similar technique an incremental version of the OPTICS algorithm [5] has been proposed in [27]. The speedup achieved by this incremental algorithm [27] is signiﬁcantly lower than that of [20]. This is attributed to the higher complexity of OPTICS, but [27] claim that the incremental version of OPTICS is suitable for a broader range of applications. In the following paragraphs we present an extension of the unsupervised kwindows clustering algorithm [51, 53] that can eﬃciently mine clustering rules from databases that undergo insertion and deletion operations over time. The proposed extension incorporates the Bkd-tree structure [39]. The Bkd-tree can eﬃciently index objects under a signiﬁcant load of updates, and also provides a mechanism that determines the timing of the updates. 4.1 The Bkd-tree Considering databases that undergo a signiﬁcant load of updates, the problem of indexing the data arises. In detail, an eﬃcient index should be characterized by the properties of high space utilization and small processing time of queries under a continuous updating process. Moreover, the processing time of the updates must be fast. To this end, we employ the Bkd-tree structure proposed in [39], that maintains its high space utilization and excellent query and update performance, regardless of the number of updates performed. The Bkd-tree is based on a well-known extension of the kd-tree (called the K-D-B-tree [42]) and on the so-called logarithmic method for making a static structure dynamic. Extensive experimental studies [39] have shown that

Novel Approaches to Unsupervised Clustering

61

the Bkd-tree is able to achieve almost 100% space utilization and also the fast query processing of a static K-D-B-tree. However, unlike the K-D-B-tree, these properties are maintained under a massive load of updates. Instead of maintaining one tree and dynamically re-balancing it after each insertion, the Bkd-tree structure maintains a set of log 2 (n/M ) static K-D-Btrees and updates are performed by rebuilding a carefully chosen set of structures at regular intervals (M stands for the capacity of the memory buﬀer, in terms of number of points). To answer a range query using the Bkd-tree, all the log2 (n/M ) trees have to be queried. Despite this fact, the worst–case behavior of the query time is still of the order O(d n1−1/d + s) (s is the number of retrieved points). Using an optimal O(n logm (n) bulk loading algorithm an insertion is performed in O(logm (n) log2 (n/M )). A deletion operation is executed by simply querying each of the trees to ﬁnd the tree Ti containing the point and delete it from Ti . Since there are at most log2 (n/M ) trees, the number of operations performed by a deletion is log(n) log 2 (n/M ) [39]. Insertions are handled completely diﬀerently. Most insertions ((M − 1) out of M consecutive ones) take place on the T0 tree structure. Whenever T0 reaches the maximum number of points it can store (M points) the smallest j is found such that Tj is an empty kd-tree. Then all points from T0 and Ti for 0 i < j are extracted and bulk loaded in the Tj structure. In other words, points are inserted in the T0 structure and periodically reorganized towards larger kd-trees by merging small kd-trees into one large kd-tree. The larger the kd-tree, the less frequently it needs to be reorganized. Extensive experimentation [39] has shown that the range query performance of the Bkd-tree is on par with that of existing data structures. Thus, without sacriﬁcing range query performance, the Bkd-tree makes signiﬁcant improvements in insertion performance and space utilization; insertions are up to 100 times faster than K-D-B-tree insertions and space utilization is close to a perfect 100%, even under a massive load of insertions. 4.2 Unsupervised k-windows on dynamic databases The proposed dynamic version of the unsupervised k–windows algorithm is based on the utilization of the Bkd–tree data organization structure. The Bkd–tree primarily enables the fast processing of range queries, and secondly provides a criterion for the timing of the update operations on the clustering result. The following schema outlines the dynamic algorithm: (a) Assume an execution of the algorithm on the initial database has been performed yielding a set of windows that describe the clustering result. (b) At speciﬁed periods execute the following steps: (1) Treatment of insertion operations. (2) Treatment of deletion operations. (c) After each of the above steps is completed update the set of windows.

62

D.K. Tasoulis and M.N. Vrahatis

This schema describes a dynamic algorithm that is able to adapt a clustering model to the changes in the database. In the following paragraphs after analyzing the workings of the algorithms for each possible update operation, a total high level description of the overall procedure is presented. Treatment of insertions: Insertion operations is the ﬁrst update operation in a dynamic environment. Throughout this paragraph it is assumed that the static unsupervised kwindows algorithm has been applied on the initial database, producing a set of windows that describe the clustering result. As insertion operations take place, the T0 structure of the Bkd–tree reaches the maximum number of points it can store (M ). At this point a number of windows are initialized over these points. Subsequently, the movement and enlargement procedures of the unsupervised k–windows algorithm are applied on these windows just as in the static case. When the movement and enlargement of the new windows terminate they are considered for similarity and merging with all the existing windows. Thus the algorithm is able to retain only the most representative windows, thereby restraining the clustering result to a relatively small size. An example of this procedure is demonstrated in Fig. 4. The ﬁlled circles represent the initial points while the empty circles represent the inserted points. The W1 and W2 windows are assumed to have been ﬁnalized from the initial run of the algorithm. On the other hand windows W3 and W4 are initialized over the inserted points (empty circles). After the completion of movement and enlargement operations for W3 and W4, they are considered for similarity and merging. This step yields that windows W1 and W3 belong to the same cluster, since they satisfy the merge operation, while, window W2 is ignored as it satisﬁes the similarity operation with window W4. W3 W1

W4

W2 Fig. 4. The application of the k-windows algorithm over the inserted points.

Treatment of deletions: The deletion update operations are addressed by maintaining a second Bkdtree structure. Each time a point is deleted, it is removed from the main data

Novel Approaches to Unsupervised Clustering

63

structure and is inserted in the second Bkd-tree. A number of windows are initialized over the points of the second Bkd-tree, when it reaches its maximum size. These windows are subjected to the movement and enlargement procedures of the k-windows algorithm, that operates only on the second data structure. When these operations terminate, the windows are considered for similarity with the windows that have been already processed. If a processed window is found to be similar with a window that contains deleted points, the former window is ignored. If the processed window that is ignored contained a large number of points new windows are initialized over these points and they are processed as new windows. After this procedure terminates, the Bkd-tree that stores the deleted points is emptied. An example of the deletion process is illustrated in Fig. 5. The ﬁlled circles represent the points that remain in the database, while the empty circles represent the deleted points. Window W1 is assumed to have been ﬁnalized from the initial run of the algorithm. Windows W2 and W3 are initialized over the deleted points (empty circles). After movement and enlargement, they are considered for similarity with the initial window, W1. Window W1 satisﬁes the similarity condition with W2 and W3 and thus it is ignored. Since window W1 contained a large number of points four windows are initialized over these points (Fig. 5(b)). The movement and enlargement operations on these yield windows W4, W5, W6 and W7. These windows are considered for merging and similarity. Windows W4 and W7 satisfy the similarity operation and thus window W7 is ignored. Windows W5 and W6 satisfy the merge operation thus they are considered to enclose points belonging to the same cluster.

(a)

(b) W3

W4

W6

W2 W1

W7

W5

Fig. 5. (a) The application of the k-windows algorithm over the deleted points. (b) The application of the k-windows algorithm over the non-deleted points contained in initial window W1.

Performing the clustering operation on the deleted points the dynamic algorithm aims at identifying the windows that need to be re-organized in the initial results. Thus, the speedup that can be achieved depends not only on the size of the updates, but also on the change they impose on the clustering result.

64

D.K. Tasoulis and M.N. Vrahatis

Proposed algorithm: Based on the procedures previously described, we propose the following high level algorithmic scheme: Dynamic unsupervised k-windows Set {the input parameters of k-windows algorithm}. Initialize an empty set W of d–ranges. Each time the T0 tree of the Bkd-tree structure is full: Initialize a set I of k d–ranges, over the T0 tree. Perform movements and enlargements of the d–ranges in I. Update W to contain the resulting d–ranges. Perform merging and similarity operations of the d–ranges in W . If a large enough number of deletions has been performed: Initialize a set D of k d–ranges over the deleted points. Apply k-windows on the d–ranges in D. If any windows in D satisfy the similarity condition with windows in W : Then delete those windows from W and If the deleted windows from W contained any not deleted points, apply k-windows over them. Report the groups of d–ranges that comprise the ﬁnal clusters. The above schema’s execution is crucially aﬀected by the size, M , of the T 0 component of Bkd-tree structure. The value of this parameter determines the timing of the update operation of the database [39], which in turn triggers the update of the clustering result. Therefore its value must be set according to the available computational power, the desired update intervals of the clustering result, as well as, the size of the application at hand.

5 Unsupervised clustering using fractal dimension It is evident by a plain examination of the objects that surround us that most of them are very complex and erratic in nature [36, 44]. Mandelbrot [32], by introducing the concept of “fractal”, was the ﬁrst to try to address the need for a model that has the ability to describe such erratic behavior. A set is called fractal if its Hausdorﬀ-Besicovitch dimension is strictly greater than its topological dimension. A characteristic for a fractal set is its fractal dimension, that measures its complexity. The box counting method [29], is an established approach to compute the fractal dimension of a set. In detail, for a set of n points in Rd , and a partition of the space in grid cells of length lb , the fractal dimension Db is given by: Db = − lim

lb →0

log10 nb (lb ) , log10 lb

Novel Approaches to Unsupervised Clustering

65

where nb (lb ) represents the number of cells occupied by at least one point. Db corresponds to the slope of the plot log10 nb (lb ) versus log10 lb . The fractal dimension has been utilized for clustering purposes in the past. A grid based clustering algorithm, that uses fractal dimension to cluster datasets, has been proposed by Barbar¨ a and Chen [7]. The algorithm, uses a heuristic based algorithm at the initialization stage to form the initial clusters and then it incrementally adds points to a cluster, as long as, the fractal dimension remains constant. Another approach for two dimensions has been proposed by Prasad et al. [37]. Both algorithms require from the user to provide an a priori estimation of the number of clusters present in the dataset. Next, we present a modiﬁcation of the unsupervised k-windows clustering algorithm, that guides the procedures of movement, enlargement and merging using the fractal dimension of the points included in the window [52]. In detail, the movement and enlargement of a window is considered valid only if the associated change of the fractal dimension is not signiﬁcant. It is also possible to guide the merging procedure by using the fractal dimension by allowing two windows to merge only if the estimated fractal dimensions are almost equal. Thus, the merging of windows that capture regions of a cluster with diﬀerent fractal dimension is discouraged. Such clusters appear in datasets where the density of points in the neighborhood of the cluster center is signiﬁcantly higher than that of areas located further away from the center. Thus, the algorithm discovers the cluster center more eﬃciently and moreover it identiﬁes regions with qualitative diﬀerences within a single cluster. Consider for example the case exhibited in Fig. 6. The enlargement and movement procedures restrain window W3 from enclosing the right part of the cluster since the fractal dimension of this region is much higher (see Fig. 6(b)). Similarly, window W4 is restrained from capturing the left part of the cluster. The proposed modiﬁcation of the algorithm also recognizes that although the windows have many points in common (see Fig. 6(a)), the diﬀerence in the value of the fractal dimension between them is suﬃciently large so as to consider them as two distinct regions of the same cluster. (a)

(b) W1

W3 W4

W2

Fig. 6. Clusters with regions of diﬀerent density. The proposed algorithm is able to discover the diﬀerent sections of the same clusters.

66

D.K. Tasoulis and M.N. Vrahatis

6 Presentation of experiments To evaluate the results of the unsupervised k-windows clustering algorithm we employ artiﬁcial datasets as well as real world ones. The ﬁrst two datasets Dset1 and Dset2 are 2 dimensional, containing 1600 and 10000 points respectively. In Dset1 the points are organized in 4 diﬀerent clusters, of diﬀerent sizes. On the other hand, Dset2 contains 100 clusters of the same size (1000 points each). The centers of each cluster for this dataset are aligned over a grid in [10, 200]2, and the corresponding points are drawn from a normal distribution with standard deviation along each dimension a random number between 1 and 2. The algorithm was applied on these two datasets with 12 and 256 initial windows respectively. The values of the parameters {θe , θm , θc , θv } were set to {0.8, 0.1, 0.2, 0.02} in both cases. k-windows was able to identify all the clusters correctly in both datasets. The datasets, as well as, the results obtained are illustrated in Fig. 7. (a)

(b)

Fig. 7. (a) Results of the k-windows algorithm for Dset1 . (b) Results of the kwindows algorithm for Dset2 .

The next dataset Dset3 is 3–dimensional, and is generated by uniformly sampling 20 cluster centers in the [10, 200]3 range. Around each cluster center 100 points are sampled from a normal distribution with standard deviation along each dimension a random number between 1 and 3. DSet4 is generated in a similar manner, but it lies in 50 dimensions. In both cases the algorithm initialized 128 windows while all other parameters were assigned to the same values as in the ﬁrst two cases. In both cases the algorithm correctly identiﬁed the 20 clusters. This result is illustrated in Fig. 8. The ﬁnal two artiﬁcial datasets Dset5 and Dset6 consist of 319 and 3651 points respectively. Both of them contain 4 non-convex irregularly shaped clusters with uniformly scattered points. The application of the k-windows algorithm on them with the same initial values and 32 and 256 initial windows, is exhibited in Fig. 9. From this ﬁgure it is obvious that the algorithm is also able to discover clusters of irregular shapes as long as enough windows are initialized over the datasets.

Novel Approaches to Unsupervised Clustering (a)

67

(b)

Fig. 8. (a) The result of the k-windows algorithm for Dset3 . (b) Results of the k-windows algorithm on a 3-dimensional projection of Dset4 (a)

(b)

Fig. 9. (a) Results of the k-windows algorithm for Dset5 . (b) Results of the kwindows algorithm for Dset6 .

The real world dataset Dset7 was part of the KDD 1999 Cup data set [26]. This dataset was generated by the 1998 DARPA Intrusion Detection Evaluation Program that was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited was provided, which includes a wide variety of intrusions simulated in a military network environment. The 1999 KDD intrusion detection contest uses a version of this dataset. For the purposes of this paper the 100000 ﬁrst records of KDD 1999 Cup train dataset were used. This part of the data set contains 77888 of patterns of normal connections and 22112 of denial of service (DoS) attacks. Over the 42 features the 37 numeric ones were selected. When the algorithm is applied over this dataset with 16 initial windows it results in seven clusters out of which one contains 22087 DoS patterns. The other six clusters contain normal patterns exclusively, with the exception of one cluster that also contains 24 DoS patterns. These results point out that the discovered clusters are meaningful, and thus the clustering result can be considered accurate. To test the eﬃciency of the distributed k-windows clustering algorithm for minimal communication environments described in Subsection 3.1, we resort to experiments that produce results that can be readily visualized. Thus, a two

68

D.K. Tasoulis and M.N. Vrahatis

dimensional dataset Dsetd consisting of 10241 points was constructed. This dataset contains 5 clusters of diﬀerent sizes, and a number of outlier points some of which connect two clusters. At a next step the dataset was randomly permuted and was distributed over 4, 8 and 16 sites. The k-windows clustering algorithm was applied to this dataset with 256, 64, 32 and 16 initial windows for the 1, 4, 8 and 16 sites respectively. The dataset, along with the clustering result, when the whole dataset resides in one single site and is distributed in 4 sites, is illustrated in Fig. 10. In Fig. 11 the results of the algorithm for 8 and 16 sites, respectively, are exhibited. As it is obvious from the ﬁgures that the results are correct in all three cases. It should be noted that for the cases of 8 and 16 sites a diﬀerent extra cluster is identiﬁed by the algorithm, but it is not considered important since in both cases it holds a small amount of points and does not aﬀect the correct identiﬁcation of the 5 main clusters. The next experimental results involve the measurement of the speedup that can be achieved through the distributed k-windows algoritm over a fast communication network, described in Subsection 3.2. To this end, we em(a)

(b)

Fig. 10. (a) Results of the k-windows algorithm for Dsetd . (b) Results of the kwindows algorithm for Dsetd for 4 sites and 64 initial windows per site. (a)

(b)

Fig. 11. (a) Results of the k-windows algorithm for Dsetd for 8 sites and 32 initial windows per site. (b) Results of the k-windows algorithm for Dsetd for 16 sites and 16 initial windows per site.

Novel Approaches to Unsupervised Clustering

69

ploy the PVM parallel programming interface. PVM was selected, among its competitors because any algorithmic implementation is quite simple, since it does not require any special knowledge apart from the usage of functions and setting up the PVM process to all personal computers. Thus, the k-windows clustering algorithm was developed under the Linux operating system using the C++ programming language and its PVM extensions. The hardware used for our purposes consisted of 16 Pentium III 900MHz personal computers with 32MB of RAM and 4GB of hard disk availability each. A Pentium 4 1.8GHz personal computer with 256MB of RAM and 20GB of hard disk availability was used as the server for the algorithm. The nodes were connected through a Fast Ethernet 100MBit/s network switch. Furthermore, we constructed an artiﬁcial dataset Dsetp using a mixture of Gaussian random distributions. The dataset contained 21000 points with 50 numerical attributes. As it is exhibited in Fig. 12, for this dataset the algorithm achieves almost 9 times smaller running time when using 16 CPUs. On the other hand, at every node only the 1/16 of the total storage space is required. From Fig. 12, we also observe an abrupt slow–down in speedup when moving from 8 to 16 nodes. This behavior is due to the larger number of messages that must be exchanged during the operation of the algorithm, which results to increased network utilization. 9

8

Speedup

7 6 5

4 3

2 1

0

2

4

6

8

10

12

14

16

Number of Computer Nodes

Fig. 12. Speedup for the diﬀerent number of CPUs.

To evaluate the performance of the proposed dynamic k-windows algorithm we will use Dset1 . As previously mentioned, this dataset contains 1600 points organized in four clusters of diﬀerent sizes. The dataset was split into four parts each part containing 400 points. The parts of the dataset were gradually presented to the algorithm. Each time a part was presented, the algorithm initialized a set of 32 windows over the new points. These windows were processed through the algorithmic procedure described above, and the clustering results for each step are presented in Fig. 13. In detail, in Fig. 13(a) and Fig. 13(b), seven clusters are identiﬁed, an outcome that appears to be reasonable by means of visual inspection. In Fig. 13(c) the algorithm detects

70

D.K. Tasoulis and M.N. Vrahatis

ﬁve clusters by correctly identifying the top right and bottom left clusters. The bottom right cluster is still divided into two clusters. Finally, in Fig. 13(d) all the clusters are correctly identiﬁed, by seven windows.

(a)

(b)

(c)

(d)

Fig. 13. Applying k-windows into four consecutive instances of Dset1 .

For comparative purposes with the work of [27, 43], we also calculated the speedup achieved by the dynamic version of the algorithm. To this end, we constructed a 10–dimensional dataset, Dsetonline , by uniformly sampling 100 cluster centers in the [10, 200]10 range. Around each cluster center 1000 points were sampled from a normal distribution with standard deviation along each dimension a random number in the interval [1, 3]. To measure the speedup we computed the CPU time that the static algorithm requires when it is reexecuted over the updated database with respect to the CPU time consumed by the dynamic version. The results are exhibited in Figs. 14, and 15. For the insertions case (Fig. 14) the dynamic version manages to achieve a speedup factor of 906.96 when 100 insertion operations occur in database of original size 90000. For a larger number of insertion operations 1000 the speedup obtained although smaller 92.414 appears to be analogous to the ratio of the number of updates to the total size of the database. For the case of deletions (Fig. 15) the speedup factors obtained are larger. For example when the size of the database is 900100 the speedup reaches 2445.23 and 148.934 for 100 and 1000 random deletions, respectively. It is important to note that in the case of deletions the speedup does not increase

Novel Approaches to Unsupervised Clustering 1000 900

71

100 updates 500 updates 1000 updates

800

Speedup

700 600 500 400 300 200 100 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 Number of points in the database (n)

Fig. 14. Speedup achieved by the dynamic algorithm for insertion operations for Dsetonline . 2500

100 updates 500 updates 1000 updates

Speedup

2000

1500

1000

500

0 10000

20000

30000 40000 50000 60000 70000 Number of points in the database (n)

80000

90000

Fig. 15. Speedup achieved by the dynamic algorithm for deletions operations for Dsetonline .

monotonously with the diﬀerence between the size of the database and the number of updates because the time complexity of the algorithm also depends on the impact deletions impose on the clustering result. The ﬁnal benchmark problem considered, Dseteq , is a two dimensional dataset of the longitudes and latitudes of the earthquakes with a magnitude greater than 4, in the Richter earthquake scale, that occurred in the period 1983 to 2003 in Greece. The dataset was obtained from the Institute of Geodynamics of the National observatory of Athens [12]. This dataset is employed

72

D.K. Tasoulis and M.N. Vrahatis

(a)

(b)

Fig. 16. (a) Results of the k-windows algorithm for Dseteq . (b) Results of the k-windows algorithm using fractal dimension for Dseteq .

not to obtain a further insight with respect to the earthquake phenomenon,

Novel Approaches to Unsupervised Clustering

73

but rather to study the applicability of the proposed algorithm to a real world dataset. Fig. 16(a) illustrates the results of the unsupervised k–windows algorithm for Dseteq . In Fig. 16(b) illustrates the results of the same dataset of the kwindows algorithm that uses fractal dimension. The modiﬁed algorithm separates regions characterized by diﬀerent fractal dimension, that were assigned to a single cluster by the original algorithm. In Fig. 16(b) characteristic examples of clusters that were separated by the modiﬁed algorithm are enclosed in black squares. This is a preliminary investigation of the application of clustering algorithms to earthquake data. An exhaustive investigation requires the inclusion of additional parameters like magnitude, depth, and time.

7 Concluding remarks The unsupervised k-windows algorithm is an iterative clustering technique, that attempts to address eﬃciently the problem of determining the clusters present in a given dataset, as well as, their number. Our experience indicates that the algorithm’s performance appears to be robust. With the incorporation of computational geometry techniques the algorithm achieves a comparatively low time complexity. The algorithm has been successfully applied in numerous applications including bioinformatics [47, 48], medical diagnosis [31, 49], time series prediction [35] and web personalization [41]. Given that the development of eﬃcient distributed clustering algorithms has attracted considerable attention in the past few years, the k-windows algorithm has been designed to be easily extended in distributed computing environments, taking under consideration privacy issues and very slow network connections [50]. For the same kind of distributed computing environments, but under the assumption of a high speed underlying network, a parallel version of the algorithm was investigated [4]. This version is able to achieve considerable speedup in execution time and at the same time attain a linear decrease on the storage space requirements with respect to the number of computer nodes used. For databases that undergo update operations, a technique was presented that is capable of tracking changes in the cluster model [51]. This technique incorporates a dynamic tree data structure (Bkd-tree) that maintains high space utilization, and excellent query and update performance regardless of the number of updates performed. The experimental results suggest that the algorithm is able to identify the changes in the datasets considered, by only updating its cluster model. Finally, a modiﬁcation of the k-windows algorithm was presented that uses the fractal dimension of the underlying clusters in order to partition the dataset [52]. This approach enables the identiﬁcation of regions with diﬀerent fractal dimension even within a single cluster. The design and development of

74

D.K. Tasoulis and M.N. Vrahatis

algorithms that can detect clusters within clusters is particularly attractive in numerous applications where further qualitative information is valuable. Examples include time–series analysis, image analysis, medical applications, and signal processing.

References 1. P.K. Agarwal and C.M. Procopiuc. Exact and approximation algorithms for clustering (extended abstract). In Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 658–667, San Francisco, California, U.S.A., 1998. 2. M.S. Aldenderfer and R.K. Blashﬁeld. Cluster Analysis, volume 44 of Quantitative Applications in the Social Sciences. SAGE Publications, London, 1984. 3. P. Alevizos. An algorithm for orthogonal range search in d 3 dimensions. In Proceedings of the 14th European Workshop on Computational Geometry. Barcelona, 1998. 4. P. Alevizos, D.K. Tasoulis, and M.N. Vrahatis. Parallelizing the unsupervised k-windows clustering algorithm. In R. Wyrzykowski, editor, Lecture Notes in Computer Science, volume 3019, pages 225–232. Springer-Verlag, 2004. 5. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. In ACM SIGMOD Int. Conf. on Management of Data, pages 49–60, 1999. 6. J. Aslam, K. Pelekhov, and D. Rus. A practical clustering algorithm for static and dynamic information organization. In ACM-SIAM Symposium on Discrete Algorithms, pages 51–60, 1999. 7. D. Barbar¨ a and P. Chen. Using the fractal dimension to cluster datasets. In KDD, pages 260–264. ACM Press, 2000. 8. R.W. Becker and G.V. Lago. A global optimization algorithm. In Proceedings of the 8th Allerton Conference on Circuits and Systems Theory, pages 3–12, 1970. 9. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R∗ -tree: An eﬃcient and robust access method for points and rectangles. In ACM SIGMOD Int. Conf. on Management of Data, pages 322–331, 1990. 10. J.L. Bentley and H.A. Maurer. Eﬃcient worst-case data structures for range searching. Acta Informatica, 13:155–168, 1980. 11. F. Can. Incremental clustering for dynamic information processing. ACM Trans. Inf. Syst., 11(2):143–164, 1993. 12. Earthquake Catalogue. http://www.gein.noa.gr/services/cat.html, Institute of Geodynamics, National Observatory of Athens. 13. P.K. Chan and S.J. Stolfo. Sharing learned models among remote database partitions by local meta-learning. In Knowledge Discovery and Data Mining, pages 2–7, 1996. 14. M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. SIAM Journal on Computing, 33(6):1417–1440, 2004. 15. B. Chazelle. Filtering search: A new approach to query-answering. SIAM Journal on Computing, 15(3):703–724, 1986. 16. B. Chazelle and L.J. Guibas. Fractional cascading: II applications. Algorithmica, 1:163–191, 1986.

Novel Approaches to Unsupervised Clustering

75

17. D. W.L. Cheung, S.D. Lee, and B. Kao. A general incremental technique for maintaining discovered association rules. In Database Systems for Advanced Applications, pages 185–194, 1997. 18. I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Artiﬁcial Intelligence, pages 245–260, 2000. 19. I.S. Dhillon and D.S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2):143–175, 2001. 20. M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment. In 24rd Int. Conf. erence on Very Large Data Bases, pages 323–333. Morgan Kaufmann Publishers Inc., 1998. 21. M. Ester and R. Wittmann. Incremental generalization for mining in a data warehousing environment. In Proceedings of the 6th Int. Conf. Extending Database Technology, pages 135–149. Springer-Verlag, 1998. 22. U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. Advances in Knowledge Discovery and Data Mining. MIT Press, 1996. 23. T. Feder D.H. Greene. Optimal algorithm for approximate clustering. In 20th Annual ACM Sympos. Theory Comput., pages 434–444, 1988. 24. J.A. Hartigan and M.A. Wong. A k-means clustering algorithm. Applied Statistics, 28:100–108, 1979. 25. H. Kargupta, W. Huang, K. Sivakumar, and E.L. Johnson. Distributed clustering using collective principal component analysis. Knowledge and Information Systems, 3(4):422–448, 2001. 26. KDD. Cup data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 1999. 27. H-P. Kriegel, P. Kroger, and I. Gotlibovich. Incremental optics: Eﬃcient computation of updates in a hierarchical cluster ordering. In 5th Int. Conf. on Data Warehousing and Knowledge Discovery, 2003. 28. W. Lam and A.M. Segre. Distributed data mining of probabilistic knowledge. In Proceedings of the 17th Int. Conf. on Distributed Computing Systems, Washington, pages 178–185. IEEE Computer Society Press, 1997. 29. L.S. Liebovitch and T. Toth. A fast algorithm to determine fractal dimensions by box counting. Physics Letters, 141A(8), 1989. 30. C. Linnaeus. Clavis Classium in Systemate Phytologorum in Bibliotheca Botanica. Amsterdam, The Netherlands: Biblioteca Botanica, 1736. 31. G.D. Magoulas, V.P. Plagianakos, D.K. Tasoulis, and M.N. Vrahatis. Tumor detection in colonoscopy using the unsupervised k-windows clustering algorithm and neural networks. In Fourth European Symposium on “Biomedical Engineering”, 2004. 32. B. B. Mandelbrot. The Fractal Geometry of Nature. Freeman, New York, 1983. 33. N. Megiddo and K.J. Supowit. On the complexity of some common geometric problems. SIAM Journal on Computing, 13:182–196, 1984. 34. O. Nasraoui and C. Rojas. From static to dynamic web usage mining: Towards scalable proﬁling and personalization with evolutionary computation. In Workshop on Information Technology Rabat, Morocco, 2003. 35. N.G. Pavlidis, D.K. Tasoulis, and M.N. Vrahatis. Financial forecasting through unsupervised clustering and evolutionary trained neural networks. In Congress on Evolutionary Computation, pages 2314–2321, Canberra Australia, 2003.

76

D.K. Tasoulis and M.N. Vrahatis

36. A.P. Pentland. Fractal-based description of natural scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):661–674, 1984. 37. M.G.P. Prasad, S. Dube, and K. Sridharan. An eﬃcient fractals-based algorithm for clustering. In IEEE Region 10 Conference on Convergent Technologies For The Asia-Paciﬁc, 2003. 38. F. Preparata and M. Shamos. Computational Geometry. Springer Verlag, New York, Berlin, 1985. 39. O. Procopiuc, P.K. Agarwal, L. Arge, and J.S. Vitter. Bkd-tree: A dynamic scalable kd-tree. In T. Hadzilacos, Y. Manolopoulos, and J.F. Roddick, editors, Advances in Spatial and Temporal Databases, SSTD, volume 2750 of Lecture Notes in Computer Science, pages 46–65. Springer, 2003. 40. V. Ramasubramanian and K. Paliwal. Fast k-dimensional tree algorithms for nearest neighbor search with application to vector quantization encoding. IEEE Transactions on Signal Processing, 40(3):518–531, 1992. 41. M. Rigou, S. Sirmakessis, and A. Tsakalidis. A computational geometry approach to web personalization. In IEEE Int. Conf. on E-Commerce Technology (CEC’04), pages 377–380, San Diego, California, 2004. 42. J.T. Robinson. The K-D-B-tree: A search structure for large multidimensional dynamic indexes. In ACM SIGMOD Int. Conf. on Management of Data, pages 10–18, 1981. 43. J. Sander, M. Ester, H.-P. Kriegel, and X. Xu. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 2(2):169–194, 1998. 44. N. Sarkar and B.B. Chaudhuri. An eﬃcient approach to estimate fractal dimension of textural images. Pattern Recognition, 25(9):1035–1041, 1992. 45. S. Sirmakessis, editor. Text Mining and its Applications, volume 138 of Studies in Fuzziness and Soft Computing. Springer, 2004. 46. D.K. Tasoulis, P. Alevizos, B. Boutsinas, and M.N. Vrahatis. Parallel unsupervised k-windows: an eﬃcient parallel clustering algorithm. In V. Malyshkin, editor, Lecture Notes in Computer Science, volume 2763, pages 336–344. SpringerVerlag, 2003. 47. D.K. Tasoulis, V.P. Plagianakos, and M.N. Vrahatis. Unsupervised cluster analysis in bioinformatics. In Fourth European Symposium on “Biomedical Engineering”, 2004. 48. D.K. Tasoulis, V.P. Plagianakos, and M.N. Vrahatis. Unsupervised clustering of bioinformatics data. In European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems, Eunite, pages 47–53, 2004. 49. D.K. Tasoulis, L. Vladutu, V.P. Plagianakos, A. Bezerianos, and M.N. Vrahatis. On-line neural network training for automatic ischemia episode detection. In Leszek Rutkowski, J¨ org H. Siekmann, Ryszard Tadeusiewicz, and Lotﬁ A. Zadeh, editors, Lecture Notes in Computer Science, volume 2070, pages 1062– 1068. Springer-Verlag, 2003. 50. D.K. Tasoulis and M.N. Vrahatis. Unsupervised distributed clustering. In IASTED Int. Conf. on Parallel and Distributed Computing and Networks, pages 347–351. Innsbruck, Austria, 2004. 51. D.K. Tasoulis and M.N. Vrahatis. Unsupervised clustering on dynaic databases. Pattern Recognition Letters, 2005. in press. 52. D.K. Tasoulis and M.N. Vrahatis. Unsupervised clustering using fractal dimension. International Journal of Biﬀurcation and Chaos, 2005. in press.

Novel Approaches to Unsupervised Clustering

77

53. D.K. Tasoulis and M.N. Vrahatis. Generalizing the k-windows clustering algorithm for metric spaces. Mathematical and Computer Modelling, 2005. in press. ˇ 54. A. T¨ orn and A. Zilinskas. Global Optimization. Springer-Verlag, Berlin, 1989. 55. C. Tryon. Cluster Analysis. Ann Arbor, MI: Edward Brothers, 1939. 56. M.N. Vrahatis, B. Boutsinas, P. Alevizos, and G. Pavlides. The new k-windows algorithm for improving the k-means clustering algorithm. Journal of Complexity, 18:375–391, 2002. 57. P. Willett. Recent trends in hierarchic document clustering: a critical review. Inf. Process. Manage., 24(5):577–597, 1988. 58. X. Xu, J. Jgerand, and H.P. Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 3:263–290, 1999. 59. C. Zou, B. Salzberg, and R. Ladin. Back to the future: Dynamic hierarchical clustering. In Int. Conf. on Data Engineering, pages 578–587. IEEE Computer Society, 1998.

Semiometric Approach, Qualitative Research and Text Mining Techniques for Modelling the Material Culture of Happiness Furio Camillo, Melissa Tosi, and Tiziana Traldi (Dipartimento di Scienze Statistiche Universit` a di Bologna) (Dipartimento di Scienze Statistiche Universit` a di Bologna) (Future Concept Lab Milano/London) Abstract. Drawing from a recent ethnographic research on Happiness carried throughout 8 European countries in the 2003/4, Future Concept Lab will illustrate how the use of interactive digital material can be relevant to analyse qualitative and quantitative data in a participatory and creative manner. Our speech will focus on the additional value of presenting data in an interactive and ﬂexible way by using a two-ways insight matrix and a word mapping statistical technique called Semiometrie. In order to exemplify their usage, we will draw on a recent research “The Material Culture of Happiness” based on the collection and the analysis of photo diaries coming from Spain, France, England, Germany, Italy, The Netherlands, Finland and Russia.

1 Introduction1 Possibly in relation to debates on the failure of capitalism to make people happy through wealth, much attention has recently focused on the idea that studies of Happiness can really provide new stimuli to public policy on how to direct future eﬀorts. Future Concept Lab has been following these debates and decided to undertake a long-term research journey into the territory of Happiness. In occasion of this NEMIS Conference FCL will present the research “The Material Culture of Happiness” claiming that this new ﬁeld of social research can make a positive impact not only for policy making but, also, to the world of consumption and to the private sector in general. Evidence will illustrate some key Happiness trends suggesting how they can be employed to the making of a “happy marketing”, a new perspective and new sensitivity towards the consumer world. 1

From “The Material Culture of Happiness” by Future Concept Lab – World Future Society 2004, in “Thinking Creatively in turbulent times”, 2004, Ed. Howard F. Didsbury Jr, World Future Society – Bethesda, Maryland – U.S.A.

F. Camillo et al.: Semiometric Approach, Qualitative Research and Text Mining Techniques for Modelling the Material Culture of Happiness, StudFuzz 185, 79–92 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

80

F. Camillo et al.

Any ﬁeld of knowledge, from religion, psychology, economics to popular culture has produced vast literature on Happiness and on how to ﬁnd the “magic formula” that increases individual and public well-being. Yet Happiness seems, for most of us, something impossible to grasp, something that is just transitory, intangible and certainly hardly possible to study following scientiﬁc venues. However, the last decade has witnessed a new generation of studies on Happiness from popular culture and, in particular, from the academic world that has made of Happiness a proper ﬁeld of knowledge. Like other social sciences, Happiness is now “taught” in university programs and has a proper journal, called the Journal of Happiness Studies edited by Ruut Veenhoven, professor of social conditions for human happiness at the Erasmus University in Rotterdam. Veenhoven believes that the conditions of Happiness can be systematically collected and analysed and that Happiness can become a new fascinating discipline. Veenhoven argues that, as Happiness research takes oﬀ, it will be possible to establish risk levels, individualise therapies and give people proper “cures” in order to maximise well-being. As he claims “We should be able to show what kind of lifestyle suits what kind of person” (4/10/03 New Scientist p. 42) This idea seems to be positively conﬁrmed by some recent “bibles” of Happiness, like the Stefan Klein’s The Happiness Formula in which is described as something achievable through learning and discipline. Klein argues that the secret is to teach your brain “how to see happiness” in our day-to-day life. He also states that Happiness has to be found individually and that “there is no one Happiness formula but as many as the living people on the earth”. Dalai Lama’s the Art of Happiness follows the same idea. According to Lama Happiness is a discipline which secrets need to be individually learnt and applied to life. Seligman himself, the father of positive psychology and author of the book The Construction of Happiness holds courses of Happiness focusing on ways in which people can learn to see the positive side of our everyday life experiences. Interest in Happiness also comes from the biological ﬁeld in which some scholars, following a more positivist philosophy, contend that 50% of our Happiness is genetically determined, it is written in a person’s DNA at birth. The hereditary nature of Happiness then mix with our personality and, for the remaining 50%, people can positively inﬂuence the achievement of happy feelings and intentions (David Lykken, University of Minnesota). What seems even more important, however, is that out of the melting pot of the Happiness stories, experts and politicians have started taking this data into account in virtue of a strong need for searching levels of well being departing from pure economic principles. Hamilton’s recent book Growth Festish (2003) can be seen as yet more evidence that the politics of today is in need of a change. Hamilton talks speciﬁcally of the unbridled pursuit of economic growth and its negative eﬀects on society. Eﬀects so often ignored by political decision makers and mainstream economists who are unwilling to acknowledge the empirical evidence. One of the most signiﬁcant observations, also

Semiometric Approach, Qualitative Research and Text Mining Techniques

81

conﬁrmed by many surveys and statistics, is that in industrialised nations the average happiness has virtually remained constant since the end the second world war despite the signiﬁcant increase of the individual level of income. Economists like Richard Easterlin, Daniel Kahnemann (Nobel 2002) and Clive Hamilton write on Happiness as a new perspective, a new way to rise the right questions that are primarily relevant to people. A possible positive reaction to all the talks on Happiness comes from the British government that has made a ﬁrst attempt to take this data into consideration. The Cabinet Oﬃce has held a string of seminars on people’s life satisfaction that is not based on the annual income only. The Prime Minister’s Strategy Unit has published a paper recommending that policies that might increase people’s happiness (www.number-10.gov.uk/su/ls/paper.pdf). The focus is on the quality of the public service and the sense of public security. Happiness has also attracted the world of marketing and consumption. The underlying principle of the so-called “happy consumption” and “happy products” is not just an new form of hedonism but it can bring to a tangible quality and innovation in the product making using Happiness as a new research venue. Future Concept Lab, as research institute and as a group of sociologists and researchers who believe that Happiness studies can make a signiﬁcant impact in the world of consumption and public service, has carried out the ﬁrst part of a cross-cultural study called “The Material Culture of Happiness”. The basis of this independent research study aims to hear people’s voice on Happiness and identify the material basis that help people to construct and reinforce their day-to-day well being. Our study therefore concentrates on the “tangible Happiness” on its daily expressions, its artefacts, its objects, products, places, people, etc. In the occasion of NEMIS 2004 FCL will therefore illustrate this research program that provides an insight on day-to-day Happiness experienced in 8 European countries namely Spain, France, UK, Italy, The Netherlands, Germany, Finland and Russia. The research focuses on the material forms, images and places that young adults (14–22) and mature adults (55–70) recognise as meaningful to the building of their day-to-day well-being. The research has been carried out by employing a cocktail of in-depth methodologies that combine psychological methods and with qualitative ﬁeldwork. Respondents have been asked to ﬁll in a photo diary for a period of seven days and taking photos to the “objects of their Happiness”, people, places, products, etc. Diaries have been followed by in-depth interviews with the respondents on the basis of ad-hoc designed discussion guides that took into consideration people’s cultural backgrounds, lifestage, and the content of the diary. The result is a collection of 1200 stories of Happiness reported in people’s own words and through visuals, symbols and drawings. The result is the analysis of some key Happiness Trends divided by age, cultural diﬀerence and speciﬁc areas of interest like, for example, Domesticity, Leisure and Consumption, Daily Responsibility, relationship with the City, Nature, etc.

82

F. Camillo et al.

2 Text Mining of Stories: Discriminant Models for the Extension of Qualitative Classiﬁcation Rules Up to now our research has been run on a not statistical sample, according to the qualitative techniques of texts decoding. Even if the happiness stories are 1200, as usually happens in qualitative research, the extension of results to a representative population is not possible. It doesn’t essentially exist a statistical representative sample of the material and of collected information, because the approach to the problem is a qualitative one, and therefore the analysis of the collected texts is based on an accurate decoding project and on a classiﬁcation of information which uses a one-to-one protocol reading of the texts. The following step is the composition of a classiﬁcation of texts in grids of conceptual and semantic interpretation in logical categories of reference. We are quoting below, as example, a happiness story written by a Finnish 14 years-old teenager. “It’s already late and I just watched a Bond ﬁlm. The day has been exhausting. Early morning wake-up didn’t inspire me at all. I washed my teeth with diﬃculty and went to my room. The feeling when I went to my bed and put a blanket on felt pleasant! That was really LOVELY! When you are really tired what could be better than your own bed. My brother and I share a room, so we have a “double-decker” bed. I sleep down and my brother up. We rarely change places, but I used to sleep up. At the moment it’s me who stays up longer, so it’s easier to sleep down. Bed and sleeping make me very happy, every day, but on other days it seems “really heavenly”. For us the bed is not only for sleeping, but it also serves as a sofa. We have a sofa in our room as well, but normally everybody sits on my bed (depends on the fact in what condition the sofa is!). And normally it’s totally covered with clothes, books and things like that. Fortunately my bed is tidy! As an adult I am going to get myself a nice big “yankee bed” or a big waterbed. Referring to a cognitive text reading, in other words to a typical process of the qualitative research, this story has been classiﬁed by FCL researchers in a grid with some interpretation keys. For example, the text quoted above has been classiﬁed through the following dichotomic variables: 1. 2. 3. 4. 5.

an event assumed in a participatory nature rather than in an intimate one; an event that belongs to the spare time; an event that belongs to mobility sphere; an event that doesn’t belong to duty sphere; an event where there is the presence of people factor or group of people, rather than the absence; 6. an event where there is the absence of animal factor, rather than their presence;

Semiometric Approach, Qualitative Research and Text Mining Techniques

83

7. an event where there is the absence of object factor, rather than its presence; 8. an event where there is the absence of space-habitat factor, rather than its presence. Our text mining target is to create statistical models of prediction using textual information (exogenous variables) for the classifying variables of qualitative nature (target variable) related to diaries’ textual contents. The suggestion is to extend the cognitive interpretative model used by qualitative experts to a large quantity of textual material (the rules); this model would be so large that can be generated from a representative sample, according to the traditional terms of the samples statistic theory. It is an atavic problem of how to extend to a reference whole, the qualitative research results that, on the contrary, requires an accurate treatment of the informative collected materials. The texts automatic classiﬁcation in predeﬁned categories is one of the most frequent problems faced by contemporary text mining, in accordance with the computational statistics and Computer Science applied to automatic categorization software [10, 11]. The strategy we adopted use a more traditional approach, less inclined to the automatic classiﬁcation, using factorial reduction techniques of collected texts variance and estimating a not parametric discriminant model, on the factor scores (input variables) of a lexical correspondence analysis applied on the stories-forms matrix. In the corpus reduction step we preferred to use a “light” semantic lemmatization criterion, because of the presence of many words teeing (graphical forms) with emotional meaning but with low frequency. As the qualitative classiﬁcation rules tries to schematize the happiness emotional causes, the textual statistical model must catch the correspondences system which arises between more frequent words and teeing with meaning words but less frequent. In Table 1 are showed the confusion matrix of discriminant models by which the stories of the selected sample in Italy have been reclassiﬁed. The confusion matrix, classically, gives the rates of wrong and right reclassiﬁcation of the discriminant model.

3 CRM, Qualitative Research and Semiometrie for Positioning and Micromarketing One of the most frequent ﬁelds of use of the qualitative information is linked to perception or self perception of products, services, trademarks, either as compared to other competitor products or to a more general context of motivational, behavioural, immaterial value domain nature, in any case not rational, which oﬀers the strategic background where the company operates. The

84

F. Camillo et al. Table 1. Confusion matrices

product positioning originates from a variety of decisions strictly connected to market segments selection where the company decides to compete. Essentially, the market positioning of a product or of a trademark consists in the perception that the customers have about the product or of the trademark, taking into consideration the products or competing trademarks position. The positioning decision, therefore, consists in determining the latent dimensions on which it is possible to build this perception in the market segments of reference and on which it is possible to distinguish the peculiar product/service from that of the concurrence. This process needs, ﬁrstly, the understanding of customers motivations and expectations. A strategy positioning is normally built on the product features, the beneﬁts required by customers, the opportunity and mode of use and the competitor positioning. It’s important to observe that, beside the tangible aspects (product technical features, price, etc.), intangible aspects can be considered too (trademark image and its prestige, the consumer habits, etc.) related to an immaterial and irrational context, that reﬂect everybody’s motivational, behavioural and immaterial value domain sphere. Research methods and analysis techniques usable for positioning strategies provide customers’ representation of perceptions in relation to the distance

Semiometric Approach, Qualitative Research and Text Mining Techniques

85

between diﬀerent products/trademarks in the market. This kinds of representations are graphically carried out by the so called “perception maps”, built by means of statistical techniques of multivariate analysis, like discriminant analysis, multidimensional scaling and correspondence analysis [8]. Our speech will deal with the micromarketing positioning, that marketing connected to the customer relations: the so-called marketing one to one, almost personalized and no more of transactional nature. In the 90ies, characterized in the West countries, by a strong and persistent stop of demographic growth, by an average age, standard of living and exports increase, we witnessed the birth and development of individual consumption models. Companies improved their own skill of market segmentation in order to identify proﬁtable customers niches. This praxis model based more and more on statistical analysis of the available data for every customer, these analysis are added to the instruments used for the market segmentation. The placing of every customer in a niche can take place through statistical or psycosociographic considerations, for which it is necessary to gather information, as more detailed as possible, about the single one, his shopping behaviour, his household, his culture and values system through which he interacts everyday with society. The direct marketing campaigns, as direct mail communication or ﬁdelity cards granting, aim at collecting information of value about single people. This kind of initiatives, actually, would allow to draw the proﬁle of the personalized relation with the customer. The development of the “customer centric” trend has driven the companies to maximize the single customers proﬁt, establishing a relation with every one of them. This relation is set out in a series of interaction from customer to company and vice versa. If the interactions are well built and well managed by the company, each of them, on the one hand, will consolidate the corporate image and the trust that the customer has for the company and, on the other hand, will enhance the company information about the customer. There’s the possibility of setting oﬀ a virtuous circle that will foster customer’s loyalty and will give the company the opportunity of increasing the proﬁt. But if the relation is wrong managed the corporate image damage will be really high. So marketing is quickly moving from a transition model to a relational one (marketing one to one or relationship marketing), according to the exemplifying scheme in Table 1. (Peppers D., Rogers M., Dorf B., 2000). In this context it is absolutely essential the use of a strong scheme of qualitative information decoding, as those that come from the stories about everyone’s happiness through the comparison with the whole west contemporary society system. We propose, therefore, to use a semiometric approach that allows the subsequent overturning on whole customer database, people valorial overturning classiﬁcation, in a classic perspective of micromarketing. But what is a semiometric approach and what’s Semiometrie? The formal deﬁnition is “a long list of words and thousand of people, in all Europe, are asked to give a mark (a score) more or less high depending on the agreeable or disagreeable characteristic of the single word” [7].

86

F. Camillo et al.

This deﬁnition is clearly the statement of a strict and elaborate experimental protocol, that describes the subject of many research on the ﬁeld, repeated in space and in time, by which information about citizens of old Europe have been collected in a 210 word list. The composition of this list is indeed the real initial value of the method. The words, in fact, have been selected through a long selection and assessment process, in order to represent, directly or indirectly, the main values of western society. As described in detail in Lebart’s, Piron’s and Steiner’s work, the lexicon of reference for the selection work has been derived from a very wide literature, characterising the whole historic process of western thought and of its expression, using even the Old Testament’s ﬁrst ﬁve books. The 210 selected words and properly declinated (substantive rather than verb or adjective; absence of article rather than determinative or indeterminative article) allow the reconstruction the psycho-cultural models that constitute the subconscious system of choice and of the identiﬁcation of desires of European citizens. In a paper dated summer 2001 two TNS Media’s researchers, the multinational institute of marketing research that holds the Semiometrie’s world patent, Richard Marks and Carine Evans, propose the research objective: “the quest for Semiometrics is to go beyond the surface cliches and prejudices, to break through the barrier of consensus – to identify subconscious associations”. Semiometrie, therefore, is based on the principle that words are not only signiﬁant of things, but they refer to values and aﬀections to which a single or a group of people is related to. The mere evocation of a word can cause agreeable or disagreeable feelings depending on everyone’s experiences and attitudes to a system of values of the historically deﬁned social environment. It is explicit the reference to the psychoanalytic theory of text’s meaning interpretation, and Semiometrie’s commercial holders don’t hesitate to state it openly in every kind of operational presentation. Schematically, all the words submitted to more than 16000 European citizens in these last years share the following characteristics: 1. 2. 3. 4.

they they they they

are are are are

representative of the western society single values; so emotionally important to provoke a reaction; not consensual; semantically stable in time.

Semiometrie is particularly useful in consumers’ proﬁle research and in studies directed to media planning, because it can reach the interviewee’s subconscious desires [9]. In particular, Semiometrie can be an eﬀective support in communication development as it ﬁnds out the value ﬁeld of the target of reference, giving an impressive landscape for the contextualization of all the value information and, in general, qualitative about subjects. As one of communication goals is to give to the consumer a values set where he can identify himself, Semiometrie can describe the value proﬁle of each population section: buyers of a product, readers of a newspaper, audience of a TV channel, people who share certain opinions, loyal customers of a trademark.

Semiometric Approach, Qualitative Research and Text Mining Techniques

87

Semiometrie essential product is a set of semiometric axes deﬁned from an analysis of the main components on matrix of agreement-disagreement score, given by interviewed people on the 210 evocative words. In particular, the six basic axes have been declined and interpreted, in importance of order, as follows: 1. 2. 3. 4. 5. 6.

methodological axis of participation; duty-pleasure axis; positive devotion to life-separation (pessimistic indiﬀerence) axis; sublimation-materialism; idealism-pragmatism; humility-supremacy.

For details about the interpretation and evaluation method of results’ stability, see Lebart’s, Piron’s and Steiner’s work. In the same work the authors discuss the problems concerning the possible comparisons between the results contained in Semiometrie and possible external researches that point out free answers about pleasant-unpleasant words or about diﬀerent texts that can be discussed by Semiometrie interpretative background. The essential idea of this kind of use on “external” texts consists in a research in the external text of semantic characteristics that connote more the semiometric axes, therefore, the interpretation of text special analysis can use the big values contrasts of mittle-European citizen. In technical terms it can happen if we try to ﬁnd assonances or dissonances between the factorial dimensions of the external text and Semiometrie general axes. The suggested indexes, for example, are correlation factors between diﬀerent graphic forms of the external text and Semiometrie factorial axes of general space. In many research conditions it corresponds with the use of supplementary projection technique of external graphic forms, on the whole of Semiometrie graphic forms. Table 2 shows that one of the textual dimension (the second one) comes out from the analysis of the collected diaries in Italy, Finland, Russia and Holland, it is decisive in the discrimination process, is easily interpretable by two of Semiometrie axes (rank correlation is equal to −0.27 and 0.18). In particular, from top to bottom of the factorial map built on the graphical forms-FCL’s qualitative variables matrix (Fig. 1) we move from pleasure to Table 2. Rank correlation between axis 1–2 of our analysis (ns) and 6 Semiometrie axis (axe)

ns1 p-value ns2 p-value

Axe1

Axe2

Axe3

Axe4

Axe5

Axe6

0,09 0,35 0,08 0,4

−0.07 0.49 −0.27 0.005

0,13 0,17 −0,14 0,14

0,05 0,59 0,18 0,05

0,04 0,65 0,06 0,53

−0,03 0,79 −0,14 0,16

88

F. Camillo et al.

Fig. 1. Lexical correspondence analysis of happiness stories and the semiometric dimensions

duty and from sublimation to materialism. In the map are written only words coming from both happiness diaries and Semiometrie vocabulary.

4 The Suggested Strategy The suggestion of a large scale use of a kind of research like FCL’s research on happiness gives to companies, which develop direct marketing campaigns, the opportunity of incorporating qualitative information in customer database. This process oﬀers the opportunity of segmenting the customers in a very sophisticated way for the achievement of a much more detailed evaluation of redemption probability of direct campaign. The ethnographic and qualitative research, therefore, can become one of the strong components of the exogenous variable set of the diﬀerent response models used to aim at, in a precise way, the customer’s promotional and retention campaign. The qualitative research, therefore, becomes important not only for strategic marketing (long-term) but for the operative one (short-term). The real problem is that spatial and temporal robustness of these results couldn’t be evaluate in a exhaustive way: values and languages change over time; over space, languages are used to declare diﬀerent emotional concepts through words with similar meaning. For all these reasons, we are trying to ﬁnd a common structure to our western society for text development that allows us to have a “landscape” (a psico-linguistic model of words choice) for describing emotions connected

Semiometric Approach, Qualitative Research and Text Mining Techniques

89

Fig. 2. Research Scheme

Fig. 3. The extension of the qualitative research to a representative (large) sample

to happiness. The textual approach we need is not a multilanguage lexicon or a contextualised dictionary which contains “turn of phrases” or “turn of feelings” (as qualitative analysis should require), but a textual approach based on a robust model over time and space. We suggest to use Semiometrie as a bridge between mapped words and unmapped ones of an external (which aren’t used in collected stories). In this way we can apply an automatic classiﬁcation model both to mapped and

90

F. Camillo et al.

Fig. 4. The textual model assessment (over time and space)

Fig. 5. Semiometrie as a bridge

unmapped stories. The question is: till when will I have to classify texts using qualitative techniques? If I go too far in time the automatic classiﬁcation model could go down! Finally, it is possible to build a model for estimating the coordinates for the words unmapped on Semiometrie and using these coordinates to classify

Semiometric Approach, Qualitative Research and Text Mining Techniques

91

Table 3. Estimated mode

stories into the qualitative classiﬁcation designed by FCL. We can also classify a respondent (a customer) using his/her happiness story into Semiometrie, and therefore also into a speciﬁc proﬁle for a direct marketing campaign. Table 3 shows, as example, an estimated model for the reconstruction of the second axe of Semiometrie, which is based on the vectorial scomposition of happiness stories obtained through lexical correspondance analysis.

References 1. New Scientist, 4/10/2003 pp. 40–44. 2. Lama, D. 1988 The Art of Happiness (Coronet Hodder&Stoughton).

92 3. 4. 5. 6. 7. 8. 9. 10. 11.

F. Camillo et al. Hamilton, C. 2003 Growth Fetish (Pluto Press). Seligman, 2000 The Construction of Happiness. (www.number-10.gov.uk/su/ls/paper.pdf). AAVV. “Thinking Creatively in turbulent times”, 2004, Ed. Howard F. Didsbury Jr, World Future Society – Bethesda, Maryland – U.S.A. Lebart L., Piron M., J.F. Steiner (2003) – La s`emiom´etrie – Dunod Parigi. Molteni L., Troilo G. (2003) – Ricerche di marketing – McGraw Hill. Evans C, Marks R. (2001) – Probing the subconscious using Semiometrie – Admap. Yang Y., Zhang J., Kisiel B. (2003) – A scalability analysis of classiﬁers in text categorization – Atti del SIGIR 2003 – Toronto Canada. Yang Y. (2000) – An evaluation of statistical approaches to text categorization – Kluwer Academic Publishers.

Semantic Distances for Sets of Senses and Applications in Word Sense Disambiguation Dimitrios Mavroeidis, George Tsatsaronis, and Michalis Vazirgiannis Department of Informatics, Athens University of Economics and Bussiness, Athens, Greece {dmavr,gbt,mvazirg}@aueb.gr Abstract. There has been an increasing interest both from the Information Retrieval community and the Data Mining community in investigating possible advantages of using Word Sense Disambiguation (WSD) for enhancing semantic information in the Information Retrieval and Data Mining process. Although contradictory results have been reported, there are strong indications that the use of WSD can contribute to the performance of IR and Data Mining algorithms. In this paper we propose two methods for calculating the semantic distance of a set of senses in a hierarchical thesaurus and utilize them for performing unsupervised WSD. Initial experiments have provided us with encouraging results.

1 Introduction Towards the direction of improving the accuracy in the retrieval process, the information retrieval community has been investigating the possible advantages of using Word Sense Disambiguation (WSD) [1] for enhancing both the query and the content with semantics. In spite of the early discouraging results [2], recent studies have clearly indicated that WSD algorithms achieving an accuracy of 50–60% can improve signiﬁcantly the precision of IR tasks [3, 4]. More precise experimental eﬀorts [5] have even reported an absolute increase of 1.73% and a relative increase of 45.9% in precision whilst utilizing a supervised WSD algorithm that reported an accuracy of 62.1%. From the Data Mining Community perspective, the process of applying WSD for improving clustering or classiﬁcation results has produced contradictory results. In [6, 7] the results presented were negative, though probably because in [7] the WSD process applied did not assign a single sense to each word, but tackled all the possible senses for all the words, while in [6] the semantic relations, like the hypernym/hyponym relation, were not taken into account. In contrast, in [8, 9], a rich representation for senses was utilized,

D. Mavroeidis et al.: Semantic Distances for Sets of Senses and Applications in Word Sense Disambiguation, StudFuzz 185, 93–107 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

94

D. Mavroeidis et al.

that exploited the semantic relations between senses, as provided by WordNet [10]. Thus, there exist indications that the correct usage of senses can improve accuracy in Data Mining tasks. In general a WSD process can be either supervised or unsupervised (or a combination of the two). The supervised WSD considers a pre-tagged text corpus that is used as a training set. The sense of a new keyword can then be inferred based on the hypothesis generated by the training set. A simple supervised learning algorithm is to calculate the frequencies of all the possible senses of a given keyword and assign the most probable sense (na¨ıve bayes classiﬁer). In WordNet [10], the senses for each word are ranked according to a probability distribution found in a large text corpus, thus the assignment of the ﬁrst sense, as provided by WordNet, to a keyword is equivalent to applying a na¨ıve bayes WSD algorithm. Although supervised approaches seem to outperform unsupervised ones, it can be argued that in speciﬁc domains the cost of constructing a training set for training a WSD algorithm can be prohibitive, and thus for such domains an unsupervised WSD algorithm may be more appropriate. In this paper we propose an unsupervised WSD algorithm that utilizes a background hierarchical thesaurus (a given ontology describing the hypernym/hyponym relation of senses) and WordNet. Firstly, using WordNet the set of all possible senses for a keyword are identiﬁed; then, the given ontology is utilized for identifying the correct sense of each keyword, using the notion of compactness. Compactness is used for measuring the level of semantic similarity of a set of senses, in order to choose the “best” set. Our approach follows the intuition that adjacent terms extracted from a given document are expected to be semantically close to each other. We present two methods for retrieving the semantic similarity of a set of senses using a hierarchical thesaurus. Our ﬁrst approach is by means of computing a modiﬁcation of the Steiner Tree [11] of a set of senses and their least common ancestor in the WordNet graph. The Tree is computed with the precondition that every terminal sense has a path to the least common ancestor. The compactness of the set of senses is computed based on the sum of the weights of the edges of this Tree. The second approach relies on a mapping scheme that maps a given ontology to a vector space; then the structure of the vector space is exploited in order to deﬁne a compactness measure by means of the centroid. This approach provides us with a geometrically interpretable compactness measure that evaluates the level of semantic similarity of a set of senses. It can be shown that there exist standard metrics in the ontology and the vector space, such that the mapping is isometric. The compactness of a set of senses is computed by means of the sum of the distances of the vector senses to their centroid. The rest of the paper is organized as follows. Section 2 discusses the related work concerning unsupervised WSD methods that rely on concept hierarchies. Section 3 presents our ﬁrst compactness measure that is based on the graph

Semantic Distances for Sets of Senses

95

structure of WordNet. Section 4 describes our second compactness measure that relies on a mapping scheme from the ontology to a vector space. Section 5 discusses the experiments performed. Finally Sect. 6 contains the concluding remarks and pointers to further work.

2 Related Work The exploitation of WordNet as a concept hierarchy has constituted the base of many WSD algorithms, both supervised and unsupervised. In this section we shall brieﬂy describe the relevant work done in unsupervised WSD algorithms utilizing WordNet as their concept hierarchy. Sussna [12] proposes a disambiguation algorithm, which assigns a sense to each noun in a window of context by minimizing a semantic distance function among their possible senses. The measure proposed is based on the assignment of weights to the edges in the WordNet noun hierarchy. For the weights computation, the is-a, has-part, is-a-part-of and antonyms relations between the noun senses are considered. Furthermore, the higher the level of the WordNet hierarchy, the greater is the conceptual distance that a semantic link between two senses suggests. Thus, Sussna’s algorithm rewards best semantic links between senses existing low in the WordNet noun hierarchy, which is rational, since the lower the level in the WordNet hierarchy of a given link, the higher the conceptual connection between the two linked specialized (due to their depth) senses. Besides the fact that this proposed method has combinatory complexity due to the pair-wise computation of the semantic distance function for a given window of context, the conceptual density of the window available senses is not computed as a whole, but as a sum of pair-wise semantic distances. Aggire and Rigau [13] introduce and apply a similarity measure based on conceptual density between noun senses. Their proposed measure is based on the is-a hierarchy in WordNet and it measures the similarity between a target noun sense and the nouns in the surrounding context. For this purpose, they divide the WordNet noun is-a hierarchy into subhierarchies, where each possible sense of the ambiguous noun belongs to a subhierarchy. The conceptual density for each subhierarchy describes the amount of space occupied by the nouns that occur within the context of the ambiguous noun. This actually measures the degree of similarity between the context and the possible senses of the word. For each possible sense the measure returns the ratio of the area occupied by the subhierarchies of each of the context words within the subhierarchy of the sense to the total area occupied by the subhierarchy of the sense. The sense with the highest conceptual density is assigned to the target word. Banerjee and Pedersen [14] suggest an adaptation of the original Lesk algorithm in order to take advantage of the network of relations provided in WordNet. Rather than simply considering the glosses of the surrounding words in the sentence, the concept network of WordNet is exploited to allow

96

D. Mavroeidis et al.

for glosses of word senses related to the words in the context to be compared as well. Essentially, the glosses of surrounding words in the text are expanded to include glosses of those words to which they are related through relations in WordNet. They also suggest a scoring scheme such that a match of n consecutive words in the glosses is weighted more heavily than a set of n one word matches. In order to clarify the notion of semantic distance we will review the most popular semantic distances deﬁned for ontologies. Before we review the most popular semantic distances for ontologies it is necessary to present the deﬁnition of path size in an ontology. Deﬁnition 1 (Path size). Let O be an ontology and p = (v1 , . . . , vn ) be a path in the ontology from sense v1 to sense vn that is deﬁned by n vertices of the ontology. We will call the size of the path, size(p) as the sum of the weight of all the edges that are contained in the path. A common element of almost all the semantic similarity measures on IS-A relations is that the similarity of two concepts c1 and c2 depends on the size of the shortest path from c1 to c2, , that goes through a common ancestor of c1 and c2 . More precisely, the larger the size of the shortest path from c1 to c2 , the larger the semantic distance. If the least common ancestor of c1 and c2 , lca(c 1 , c2 ) exists then the shortest path can be encoded as the path from c1 to lca(c 1 , c2 ) and from lca(c 1 , c2 ) to c2 . The only exception is the Resnik measure, where the similarity between two concepts is depends on the size of the path form the root of the ontology to the least common ancestor. Resnik Measure [15]: The similarity of two concepts c1 and c2 that lie in an IS-A ontology is deﬁned as: Sim(c1 , c2 ) =

max

c∈Supp(c1 ,c2 )

IC(c)

Where IC(c) is the information content of concept c and where Supp(c1 , c2 ) represents all the concepts in the ontology that are more general than both c1 and c2 . Hirst-St-Onge Measure [16]: The strength of the relationship between two concepts c1 and c2 in an ontology is deﬁned as: Rel(c1 , c2 ) = C − pathlength − k · d Where d is the number of changes in the direction of the path and C and k are constants. Jian-Conrath Measure [17]: The distance of two concepts c1 and c2 that lie in an IS-A ontology is deﬁned as: Dist(c1 , c2 ) = IC(c1 ) + IC(c2 ) − 2 · IC(lca(c1 , c2 )) Where IC(c) is the information content of a concept c and lca(c 1 , c2 ) is the lowest common ancestor of c1 and c2 .

Semantic Distances for Sets of Senses

97

Leacock-Chodorow Measure [18]: The similarity of two concepts c1 and c2 that lie in an IS-A ontology is deﬁned as: len(c1 , c2 ) Sim(c1 , c2 ) = − log 2D Where len is the length of the path that connects the two concepts and D denotes the maximum depth of the taxonomy. Lin Similarity [19]: The similarity of two concepts c1 and c2 that lie in an IS-A ontology is deﬁned as: Sim(c1 , c2 ) =

2 · IC(lca(c1 , c2 )) IC(c1 ) + IC(c2 )

Where lca(c 1 , c2 ) deﬁnes the least common ancestor of concepts c1 and c2 . We can easily observe that the common characteristic of these measures is that the semantic distance of two senses c1 and c2 depends on the size of the shortest path (through the least common ancestor) that connects c1 and c2 in the ontology, or on the size of the path that connects the lca(c1 , c2 ) to the root of the ontology.

3 Compactness Measure Based on Ontology Graph As we have discussed in the introduction section we aim in computing the semantic similarity of a set of concepts. Semantic similarity between two concepts that reside in an ontology depends on the shortest path (through a least common ancestor) that connects the two concepts. Thus, it is natural to investigate possible extensions of the shortest path notion for a set of concepts. In order to deﬁne the compactness measure we will look into graph theoretic measures that account for the shortest path that connects a set of concepts. The Steiner Tree, deﬁned as the Tree with the smallest cost that contains a set of concepts, is the graph theoretic notion that we will utilize in order to deﬁne our compactness measure. More precisely we will use a modiﬁcation of the Steiner Tree that takes into account the nature of semantic similarities in an ontology. Recall that the distance between two concepts in an ontology is not deﬁned as the shortest path that connects them in the ontology, but rather as the shortest path that goes through a common ancestor. Thus, it can be argued that two concepts are connected only through a common ancestor and not through any other path in the ontology. Consequently, it is natural to consider the computation of the Steiner Tree of the set of concepts, with their least common ancestor such that each concept has one path to the least common ancestor. The existence of the least common ancestor (and of a path of every concept to the least common ancestor) would guarantee that a path connecting all pairs of concepts (in the context discussed earlier) exists in the Steiner Tree.

98

D. Mavroeidis et al.

We can now deﬁne the common distance of a set of concepts as the cost of the Steiner Tree of the concepts and their least common ancestors such that each concept has one path to the least common ancestor. Under this deﬁnition, the most compact set of concepts will be the set with the smallest common distance.

4 Compactness Measure Based on Mapping In this section we will present the details of the mapping from the ontology to a vector space. As our main aim is to introduce a compactness measure, we are interested in preserving our ability to measure semantic distances in the new vector space. 4.1 Mapping of Tree Ontology to Vector Space When mapping an ontology to a new space (a vector space) we aim in preserving our ability to measure semantic distances (as we can in the ontology by means of the least common ancestor etc.) in the new space. The mapping of the ontology to a vector space would provide us with the capability of using geometrically interpretable compactness measures (such as by means of the centroid). We will now present the main notions and deﬁnitions for mapping the concepts from the tree ontology to a vector space. The structure of the vector space would allow for the calculation of a geometrically interpretable compactness measure with the use of the centroid. Since the compactness measure that we aim to use in the vector space relies on the distances between the vectors, our main goal in the mapping procedure will be preserve our ability to measure semantic distances on the vector space using standard vector space distances (such as the Euclidean, Manhattan, etc.). We will ﬁrstly deﬁne the vector space on which the concepts will reside. Deﬁnition 2. Let O be an IS-A tree ontology. We deﬁne the Ontology vector space VO , an n-dimensional real vector space (where n is the number of edges of the tree ontology), where each edge of the ontology corresponds to a dimension of V through the function corr(i, j). corr(i, j) denotes that the ith dimension corresponds to the jth edge. Now we will deﬁne the exact process with which the concepts in the ontology are mapped to the vector space. Deﬁnition 3. Let O be an IS-A tree ontology and V be a vector space, we deﬁne a function from the Ontology O to V as fg : O → V with fg (c) = (x1 , . . . , xn ). If (e1 , . . . , ek ) are the weights corresponding to the edges of the path of c to the root, and corr(i, j) denotes the correspondence of the ith

Semantic Distances for Sets of Senses

99

dimension to the jth edge, we will have that xi = g(ej ), where g is a function that maps edge weights to the corresponding dimensions. For the remaining xj (no edge correspond to these xj ) we will have xj = 0. We will refer to the fg (c) as concept vectors. As we have discussed in the Related Work section the semantic distances that are deﬁned on an IS-A ontology depend on the size of the shortest path from c1 to c2 through a common ancestor, or on the size of the path from the least common ancestor of c1 and c2 to the root of the ontology. In the following propositions we will show that there exist mappings fg and standard distance and similarity measures in VO vector space (i.e. product, the Euclidean distance) such that the distance of two vector concepts fg (c1 ) and fg (c2 ) depends on the size of the shortest path from c1 to c2 or on the size of the least common ancestor of c1 and c2 to the root of the ontology. We will also show that in the cases of Resnik measure and Jian-Conrath measure there exist mappings and measures in the vector space that produce exactly the same results. Proposition 1. Let O be an IS-A tree ontology and VO be the vector space that corresponds to O, then there exists a mapping function fg such that the inner product of two concept vectors c1 and c2 in VO fg (c1 ), fg (c2 ) is equal to the size of the path of lca(c1 , c2 ) to the root. Proof. We consider the mapping fg from the Ontology to VO such that for √ the weights of the edges ej we will have g(ej ) = ej The dot product in vector spaces is deﬁned as: xi yi , where xi and yi are the coordinates of fg (c1 ) and fg (c1 ), fg (c2 ) = fg (c2 ). From the embedding procedure we will have that the vector concepts will have common coordinates for the dimensions that correspond to the path from lca(c 1 , c2 ) to the root, and for all the other dimension it will be either be xi = 0 or yi = 0. Thus, we can write: fg (c1 ), fg (c2 ) = x2i , where the xi are the dimensions that correspond to the edges of the path from the lca(c 1 , c2 ) to the root. From the embedding procedure these xi will correspond to the weights of the edges of the path from the lca(c 1 , c2 ) to the root. Thus if we have ei to be the weights of the edges of the path from the lca(c 1 , c2 ) to the root of the √ ontology we will have g(ei ) = ei and thus we can write: fg (c1 ), fg (c2 ) = ei , where the ei are the weights of the edges that belong to the path from the least common ancestor to the root of the ontology. Thus we have shown that there exists a mapping fg : O → VO such that the inner product in VO is equal to the size of path from the least common ancestor to the root of the ontology.

100

D. Mavroeidis et al.

As a special case of this proposition we can show that there exists a mapping fg such that the inner product is equal to the Resnik similarity measure: Proposition 2. Let O be an IS-A tree ontology, then there exists a function fg , and a weighting scheme for the ontology, such that the Resnik similarity measure is equal to the dot product in the VO . Proof. The weighting scheme that we consider for an edge (v1 , v2 ), where v1 is more general than v2 is IC(v 2 )−IC(v1 ). We consider now a function fg such √ that g(ei ) = ei . In VO the dot product will be: xi yi fg (c1 ), fg (c2 ) = From the embedding procedure we will have that the vector concepts will have common coordinates for the dimensions that correspond to the path from lca(c 1 , c2 ) to the root, and for all the other dimension it will be either be xi = 0 oryi = 0. Thus we can write: fg (c1 ), fg (c2 ) = IC(lca(c1 , c2 ))−IC(f ather(lca(c1 , c2 )) + . . . .. + IC(child (root)) − IC(root) = IC(lca(c 1 , c2 ))−IC(root) = (we can assume that the probability of the root is 1 and thus its information content is 0 and write that) fg (c1 ), fg (c2 ) = IC(lca(c 1 , c2 )) Thus we have shown that there exists a function fg , and a weighting scheme for the ontology, such that the inner product of the VO vector space is equal to the Resnik measure. Proposition 3. Let O be an IS-A tree ontology and VO be the vector space that corresponds to O, then there exists a function fg for each Minkowski distance, such that each Minkowski distance of two concept vectors c1 and c2 in VO is proportional to the shortest path between c1 and c2 in the ontology. Proof. We consider the mapping fg from the Ontology to VO such that for √ the weights of the edges ej we will have g(ej ) = p ej For c1 and c2 concepts that the Minkowksi distance in VO is: we will have |xi − yi |p , where xi will be the coordinates that dp (fg (c1 ), fg (c2 )) = correspond to vector fg (c1 ) and yi will be the coordinates that correspond to fg (c2 ). For the dimensions that correspond to edges that are above lca(c 1 , c2 ) we will have xi = yi . For the dimensions that correspond to edges that don’t belong to the path of either c1 or c2 we will have xi = yi = 0. Thus one can write: dp (fg (c1 ), fg (c2 )) = |xi |p + |yi |p i∈edjes(c1 ,lca(c1 ,c2 ))

i∈edjes(c2 ,lca(c1 ,c2 ))

Semantic Distances for Sets of Senses

101

From the deﬁnition of the g function we can derive that dp (fg (c1 ), fg (c2 )) is equal to the sum of weights of the edges of the shortest path from c1 to c2 . Thus we have shown that for any Minkowski distance, there exists an fg such that the Minkowski distance on vector space depends on the size of the shortest path in the ontology. For a special case of the above proposition (Minkowski distance with p = 1), we will show that there exists a mapping such that it is equal to the Jian-Conrath measure. Proposition 4. Let O be an IS-A tree ontology and VO the vector space that corresponds to O, then there exist a function fg such that the Manhattan distance (Minkowski distance with p = 1) of the VO vector space is equal to the Jian-Conrath measure. Proof. It can be derived in a straight forward manner from Proposition 3. In this subsection we have described our mapping scheme that embeds the concepts of an ontology in a vector space such that we preserve the ability to measure semantic distances in the new space. The ability to measure semantic distances is veriﬁed by the four propositions presented in this subsection. In the following subsection we will exploit the structure of the vector space in order to deﬁne a geometrically interpretable compactness measure with the use of the centroid. 4.2 Compactness Measure A widely used method for measuring the compactness in vector spaces is by means of the centroid. 1 xi c= n We will utilize the centroid in order to deﬁne a common distance measure for a set of vectors, that measures the density of the senses in the vector space. More precisely the measure of common distance of a set of vectors is deﬁned as the sum of squared distances of the vectors to the centroid, and the most compact set of vectors will be the set with the smallest common distance. d2 (xi , c) cd(xi ) = 4.3 Mapping of Non-Tree Ontologies to Vector Spaces As it is clearly indicated, the theory discussed in the previous section applies only to tree ontologies. However, the most widely used ontology WordNet, is not a Tree, and a concept in WordNet can have multiple paths to the root of the ontology. The existence of multiple paths to the root, would prevent us

102

D. Mavroeidis et al.

from performing the embedding procedure described in the previous section. In order to overcome this problem, we will consider multiple versions of a concept in the vector space. Each version, will correspond to a distinct path of the concept to the root of the ontology. It can be easily observed that not all distances in the vector space are “valid”. They are not “valid” in the sense that they do not correspond to distances in the original ontology (i.e. multiple version of a concept will have non zero distance). Thus, in order to construct the centroid for the set of vector concepts, we need to consider only the “valid” distances in the vector space. In order to address the problem we connect all the nodes that correspond to “valid” distances in the original ontology. This can be considered as a graph, where edges correspond to the “valid” distances. Then we can deﬁne the centroid of this structure as the centroid of the edges’ centriod. The centroid deﬁned will only consider the “valid” distances in the vector spaces and thus, it can be utilized in order to deﬁne a compactness measure in a similar manner as in the previous subsection.

5 Experiments In this section a series of initial experiments is described with respect to the application of the proposed compactness measure, that relies on the graph structure of the ontology presented in Sect. 3, in WSD. For the purposes of our experiments we used WordNet 1.7.1 [10] for our concept hierarchy and a set of texts semantically annotated with this version of WordNet. The set we used is SemCor 1.7.1, downloadable from [20], which is a subset of the Brown corpus. SemCor 1.7.1 contains 186 semantically tagged Brown Corpus ﬁles, with all content words tagged with WordNet 1.7.1 senses, and 166 semantically tagged Brown Corpus ﬁles with only the verbs being tagged. In our experiments we only considered nouns, and from the 186 ﬁles, we chose the same 4 ﬁles that were chosen in [21] so as our experiments could be comparable with the unsupervised WSD method proposed by Agirre and Rigau. The chosen ﬁles were br-a01 (“a” standing for the genre “Press:Reportage”), br-b20 (“b” standing for the genre “Press:Editorial ”), br-j09 (“j” standing for the genre “Learned:Science”) and br-r05 (“r” standing for the genre “Humour ”). The distribution of the nouns contained among the 4 texts, that were semantically tagged with senses contained in WordNet 1.7.1, is presented in Table 1. 5.1 Experimentation Setup In the series of experiments that follow, we utilized the hypernym-hyponym relation as the link among the WordNet senses and in order to disambiguate the noun words we examined them by taking windows of adjacent words, and

Semantic Distances for Sets of Senses

103

Table 1. Distribution of nouns contained in WordNet 1.7.1 among the 4 texts Text

Nouns Contained in WordNet 1.7.1

br-a01 br-b20 br-j09 br-r05

485 387 621 446

Total

1939

then ﬁnding the most compact set of senses representing the words. By selecting the words to be disambiguated in windows (i.e. a window of n words is a set of words containing n adjacent words in a text) we came upon two diﬀerent problems. The ﬁrst problem concerned computational complexity. By taking large size windows (i.e. windows of 20 words), if we examined all the possible combinations of the senses corresponding to these words, we would need to examine combinations in the order of magnitude of hundred millions. We tackled this problem by using simulated annealing [22], thus cutting down the number of examined combinations in the order of magnitude of few thousands at most for each window. The second problem regarded with the fact that based on the hypernym-hyponym relation, WordNet 1.7.1 contains 9 diﬀerent disconnected noun senses hierarchies. It could be the case that in a single senses combination of a window of n words, the senses in that combination are not distributed in the same WordNet hierarchy, thus not allowing for the compactness computation. That problem was partially tackled by considering the compactness in that case as the sum of the individual compactness. Even under that consideration for the compactness computation, there could be cases where in a given combination of senses, a sense is found alone (without any other senses of that combination) in a WordNet hierarchy. In this case compactness measure could not be applied, since compactness could hold in cases where at least two senses are connected with the hypernym-hyponym relation, which can be translated in at least two senses existing in the same WordNet hierarchy. In order to tackle with those cases, we conducted experiments with large size windows (i.e. windows of 20 and 30 noun words), thus increasing the probability that each sense in a given combination exists with at least one more sense in the same WordNet hierarchy. Even by increasing the size of the window though, there were cases where senses in a given combination existed as singles in a WordNet hierarchy. Since the window increment could not solve this problem fully, we decided that each such sense contributed a zero (0) in the total compactness. For the purposes of our experiments, and in order to evaluate the behavior of the compactness measure in the all the aforementioned situations, we conducted three series of experiments. In all three series, simulated annealing was applied. In the ﬁrst series, we executed WSD for window sizes varying from 3

104

D. Mavroeidis et al.

to 5, without allowing in any given combination of senses a sense being found alone in a WordNet noun hierarchy. In the second series of our experiments we conducted WSD for window sizes varying from 3 to 10, with the diﬀerence from the previous series of experiments being the permission of existence of at most one sense being left alone in a WordNet noun hierarchy, in any given combination of senses. The third series of experiments was executed in large size windows (noun word windows of size 20 and 30) where intuitively we believed our compactness measure would reach its top performance for large coverage. In this last series, we allowed all scenarios in any given combination of senses. In the following section the experiments results are presented and discussed. 5.2 Experiments Results and Evaluation In Table 2, the results from the ﬁrst series of the experiments are presented. Table 2. Precision for the ﬁrst series of experiments, for window sizes 3–5 Window Size

Disambiguated Nouns

Coverage

Ambiguous Nouns

Precision

3 4 5

177 113 64

9,12% 5,82% 3,3%

61 35 21

88,13% 87,61% 87,50%

From the results Table 2, it is obvious that prohibiting the existence of senses left alone in a WordNet hierarchy given a senses combination, cannot provide us with high coverage. The precision in this low coverage is naturally high. The observation that aroused from this series of experiments is that our compactness measure behaves well when applied in WSD, but higher coverage was needed to document upon this. When trying to increase the window size in this ﬁrst series of experiments (i.e. windows of size 6 and above), the coverage seemed to decrease. Thus, we conducted the second series of experiments, the results of which are presented in Table 3, where we allowed at most one sense being left alone in a WordNet noun hierarchy, at any given combination of senses. In this second series of experiments, we managed to increase the coverage, while maintaining the precision at high levels. The permission of existence of at most one sense being left alone in a WordNet noun hierarchy did not seem to aﬀect our precision much, which is rational, especially in the medium size windows (i.e. windows of size 9 and 10). This second series of experiments proved encouraging, thus we ﬁnally conducted a third series of experiment, where any scenario with regards to senses left alone in a WordNet noun hierarchy would be allowed. Intuitively, this would make sense if we incremented

Semantic Distances for Sets of Senses

105

Table 3. Precision for the second series of experiments, for window sizes 3–10 Window Size

Disambiguated Nouns

Coverage

Ambiguous Nouns

Precision

3 4 5 6 7 8 9 10

744 361 219 177 128 83 54 41

38,37% 18,61% 11,29% 9,12% 6,6% 4,28% 2,78% 2,11%

413 168 99 93 61 37 23 16

74,46% 78,67% 75,34% 77,4% 76,56% 75,9% 81,48% 82,92%

the window size, thus increasing the probability noun senses in any given combination existed with at least one more sense of this window in the same WordNet hierarchy. The results of this ﬁnal series of experiments are presented in Table 4. Table 4. Precision for the third series of experiments, for window sizes 20 and 30 Window Size

Disambiguated Nouns

20 30

1939 1939

Accuracy

Coverage

Ambiguous Nouns

Compactness

C. Density

100% 100%

1371 1371

56,73% 61,06%

60,1% 60,1%

The results of this experiment were now comparable with the C. Density measure of Aggire and Rigau [21], since full coverage was reported. Compactness precision was close to C. Density when a window size of 20 was selected, while behaved better than C. Density in a window of 30 words. These initial three series of experiments prove that the proposed compactness measure can be successfully applied in WSD tasks, while, in parallel, we intent to scale our experiments in larger windows and in more SemCor 1.7.1 documents.

6 Conclusions and Further Work In this paper we have presented two compactness measures for calculating the similarity of a set of concepts that reside in a hierarchical ontology. The ﬁrst measure relies on the graph theoretic notion of Steiner Tree, while the other relies on the mapping of the ontology concepts to a vector space. We have conducted initial experiments in order to verify our approach for Word Sense Disambiguation, using the graph theoretic compactness measure and

106

D. Mavroeidis et al.

SemCor 1.7.1. The experiments have produced encouraging results, regarding the ability of our compactness measures to perform Word Sense Disambiguation. Concerning further work we aim in conducting exhaustive experiments on SemCor and SenSeval [23] datasets, comparing both our approaches to other unsupervised learning algorithms for WSD. Moreover, we aim to investigate possible solutions (such as with gloss overlaps) to overcome the problem of the existence of the 9 disconnected WordNet noun hierarchies.

References 1. Ide, N., V´eronis, J.: Word Sense Disambiguation: The State of the Art. Journal of Computational Linguistics (1998) 24(1) 1–40 2. Sanderson, M.: Word Sense Disambiguation and Information Retrieval. In: Proc. of SIGIR- 94, 17th ACM International Conference on Research and Development in Information Retrieval (1994) 49–57 3. Shutze, H., Pederson, J.O.: Information Retrieval Based on Word Senses. In: Proc. Of the 4th Annual Symposium on Document Analysis and Information Retrieval (1995) 161–175 4. Gonzalo, J., Verdejo, F., Chugur, I., Cigarran, J.: Indexing with WordNet synsets can Improve Information Retrieval. In: Proc. of the COLING/ACL’98 Workshop on Usage of WordNet for NLP (1998) 5. Stokoe, C., Oakes, M.P., Tait, J.: Word Sense Disambiguation in Information Retrieval Revisited. In: Proc. of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval (2003) 159–166 6. Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of Word- and Sense-based Text Categorization Using Several Classiﬁcation Algorithms. Journal of Intelligent Information Systems (2003) 21(3) 227–247 7. Scott, S., Matwin, S.: Feature Engineering for Text Classiﬁcation. In: Proc. of ICML-99, 16th International Conference on Machine Learning (1999) 379–388 8. Hohto, A., Staab, S., Stumme, G.: WordNet improves Text Document Clustering. In: Proc. of the SIGIR 2003 Semantic Web Workshop (2003) 9. Bloehdorn, S., Hotho, A.: Boosting for Text Classiﬁcation with Semantic Features. In: Proc. of the SIGKDD 2004 MSW Workshop (2004) 10. Website: WordNet – a lexical database for the English Language. http://www. cogsci.princeton.edu/∼wn/ 11. Hwang, R., Richards, D., Winter, P.: The Steiner Tree Problem. In: volume 53 of Annals of Discrete Mathematics (1992) 12. Sussna, M.: Word Sense Disambiguation for free-text indexing using a massive semantic network.. In: Proc. of the second international conference on Information and Knowledge Management (1993) 67–74 13. Agirre, E., Rigau, G.: A proposal for word sense disambiguation using conceptual distance. In: Proc. of the 1st International Conference on Recent Advances in NLP (1995) 14. Banjeree, S., Pedersen T.: Extended gloss overlaps as a measure of semantic relatedness. In: Proc. of the Eighteenth International Joint Conference on Artiﬁcial Intelligence (2003) 805–810

Semantic Distances for Sets of Senses

107

15. Resnik, P.: WordNet and class-based probabilities. In: C. Fellbaum (Ed.), WordNet: An electronic lexical database. MIT Press, (1998) 239–263 16. Hirst, G., Onge, D. St.: Lexical chains as representations of context for the detection and correction of malapropisms. In: C. Fellbaum (Ed.), WordNet: An electronic lexical database. MIT Press, (1998) 305–332 17. Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. of International Conference on Research in Computational Linguistics (1997) 19–33. 18. Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word sense identiﬁcation. In: C. Fellbaum (Ed.), WordNet: An electronic lexical database. MIT Press, (1998) 265–283 19. Lin, D.: Using syntactic dependency as a local context to resolve word sense ambiguity. In: Proc. of the 35th Annual Meeting of the Association for Computational Linguistics (1997) 64–71 20. Website: UNT Center for Research on Language and Information Technologies http://mira.csci.unt.edu/downloads.html 21. Agirre, E., Rigau, G.: Word Sense Disambiguation Using Conceptual Density. In: Proc. of COLING-96 (1996) 16–22 22. Cowie, J., Guthrie, J., Guthrie, L.: Lexical disambiguation using simulated annealing. In: Proc. of the 14th International Conference on Computational Linguistics (1992) 359–365 23. Website: Senseval Web page, http://www.senseval.org

A Strategic Roadmap for Text Mining Georgia Panagopoulou Quantos S.A.R.L [email protected]

Abstract. A roadmap is typically a time-based plan that deﬁnes the present state, the state we want to reach and the way to achieve it. This includes identiﬁcation of exact goals and the development of diﬀerent routes for achieving them. In addition, it provides guidance to focus on the critical issues that are needed in order to meet these objectives. The roadmap of NEMIS aims at preparing the ground for future Text Mining RTD activities by investigating future research challenges and deﬁning speciﬁc targets. For the development of the roadmap of NEMIS a “scenariodriven approach” has been used, meaning that several scenarios for potential future applications concerning Text Mining have been developed. These scenarios were used to reﬂect emerging user needs, combine them with key technologies and provide a snapshot of the future. The produced roadmap has also shown possible ways of realising these scenarios and identiﬁed the directions for future technology evolution.

1 Introduction A roadmap is a time-based plan, which deﬁnes the situation we want to reach (with regard to a speciﬁc topic) and designs the paths for reaching this situation, on the basis of the current state. As an example of a roadmap type, technology roadmapping is a “needs-driven” technology planning process to help identify, select and develop technology alternatives to satisfy a set of product needs. Roadmaps are a necessary mechanism for industries and companies to take an extended look into the future, identify emerging requirements, foresee new technologies and product innovation and prepare for this evolution. There are many methodologies, models and concepts on the subject of roadmapping and international literature presents a vast pool of resources on the topic. One of the most interesting examples is the science and technology roadmap, since this type focuses on trends and trajectories. Science and technology roadmaps are developed with the purpose of identifying or setting future trends and industry targets. These roadmaps are often

G. Panagopoulou: A Strategic Roadmap for Text Mining, StudFuzz 185, 109–122 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

110

G. Panagopoulou

R&D oriented and are used as a means of forecasting advances in Science and Technology without focusing on speciﬁc industries. Having this background in mind, the Text Mining roadmap developed under the framework of NEMIS project aimed at preparing the ground for future Text Mining RTD activities by investigating future research challenges and reﬂecting a snapshot of the future. In contrast with other used methods, which start with a forecast of the future, the roadmap process starts with the endpoint (the situation to be reached) clearly identiﬁed and then designs the potential technology paths to achieve it. For the development of the roadmap of NEMIS, a “scenario-driven approach” has been used. In this approach, diﬀerent scenarios for potential future applications concerning Text Mining have been developed. The aim of these scenarios has been to envisage emerging requirements, reﬂect potential user needs, combine them with the underlying technologies and provide a vision of the future. The produced roadmap has demonstrated possible ways of realising these scenarios, based on the use of the core supporting technologies.

2 Outline of Roadmap Development Methodology For the development of the Text Mining roadmap, two main stages can be identiﬁed.

User requirements collection and analysis

Technologies vs UR Matching

Core Supporting Technologies

UR Consolidation

Fig. 1. Analysis Methodology

In the ﬁrst stage of the roadmap production, the existing and emerging user requirements have been analysed, categorised and consolidated in one uniﬁed list, while the supporting enabling technologies, with regard to text mining, have been clariﬁed and studied. For achieving this goal, a set of predeﬁned analysis dimensions has been drafted on the basis of international experiences. Finally, the identiﬁed user requirements have been matched with the enabling technologies and the scenario management approach has been designed. Based on this approach, a set of key indicative scenarios, which cover diﬀerent aspects of the user requirements and the key enabling technologies, has been selected. At the ﬁnal stage, the scenario management approach has been extended in many ways, focusing on various aspects not considered so far (e.g. the scenarios have been analysed, in terms of technological and economic aspects

A Strategic Roadmap for Text Mining

111

and speciﬁc technology roadmaps have been outlined for each of the scenarios, as well as for the enabling technologies as a whole.) Moreover, multiple technology dependencies have been pinpointed, while the roadmap has been validated against other similar ongoing experiences. Finally, one integrated roadmap has been proposed, on the basis of which speciﬁc recommendations for future research have been outlined. In brief, for the production of the roadmap, we used as an input: • User requirements regarding Text Mining usage and applications, which have been collected on the basis of the results of the market survey, the case studies produced by the project’s working groups and by the input available from the network members; • A list of core supporting technologies, necessary for the evolution of TM; • The scenario management approach deﬁned in the relevant literature. • Next to this, we have provided the following outputs: • Outline visionary scenarios • Identify the emerging user requirements (from the scenarios) • Outline how technologies should evolve, in order to fulﬁl the needs emerging from the scenarios • A time-plan for the required evolution of the core technologies, in order to meet the emerging user requirements.

3 Input Analysis Methodology For the production of the Text Mining Strategic Roadmap, we have followed a well-deﬁned analysis methodology, which included a set of concrete steps. In brief, these were the following: • User requirements analysis; User requirements analysis has been based mainly on the results of an extended market survey, implemented within the framework of NEMIS project, the case studies implemented by consortium partners, the literature review and the individual inputs from consortium partners. The aim was to identify the most critical characteristics that text mining solutions should present to their users. • Presentation and analysis of the core technologies; The analysis of the core enabling technologies corresponded to the selection of those technologies that support the evolution of text mining and the identiﬁcation of the critical characteristics of each of them. • Matching technologies with user requirements. This last step of the analysis methodology consisted of the consolidation of the user requirements and their matching against the key selected technologies, in the framework of text mining applications and tools.

112

G. Panagopoulou

3.1 Analysis of User Requirements Four major sources have been used for obtaining a clear view of the user requirements and analysing them: • Results of the market survey; • Support from NEMIS partners and literature review in order to identify available user requirements studies for text mining; • Analysis of the case studies performed by the partners; • Validation of the results in partners’ meetings as well as in conferences and workshops (NEMIS annual conference in Rome, NEMIS workshop in Barcelona) User Requirements Categorisation and Consolidation A ﬁrst extended list of user requirements was drafted as a result of the market survey. This list was further analysed and processed, in order to conclude to a list of categorised requirements. The approach used for the consolidation of the available input sources with the resulting requirements of the NEMIS requirements analysis has been based on the following concrete tasks: • Implementation of case studies by partners, analysed with regard to: – Current information access problems; – Objectives and beneﬁts of a TM application strategy; – Challenging theoretical research issues for evolution of text mining; – Main forces and reasons to use text mining; – Critical success factors regarding adoption of TM solutions; – Relevant TM product characteristics (today and in the future). • Transformation of results – After the analysis of the implemented case studies, both the results of the NEMIS market survey report and the results of these case studies have been transformed into the same format in order to become comparable. • Categorisation of the user requirements according to predeﬁned categories – In order to classify the collected user requirements of TM, three main categories were identiﬁed: – “Information availability problem solving” requirements (general requirements for solving knowledge and information access problems, related to the general goals and motivations for using TM solutions); – “Technology” requirements (functional and technical requirements for TM systems and modules); – “Functionality/usability of systems” requirements. • Consolidation of requirements – As a ﬁnal step the transformed and categorised requirements were compared and the redundant ones have been eliminated. This transformation aimed at creating a more consistent and coherent overall picture of the situation.

A Strategic Roadmap for Text Mining

113

Aggregated User Requirements Due to the several factors (i.e. the length of the list of selected user requirements, the redundancy of some of them, the high degree of detail and specialisation, etc.) all user requirements of the three main requirement categories were further processed and clustered into a “super”-category of “common user requirements”. This was an intermediate step between technology vs. requirements matching and the analysis of the four selected key scenarios. The result of this process was the following list of 11 core requirements: • • • • • • • • • • •

Cost eﬃciency; Decision support; Individual competency building; Innovation; Integration; Knowledge sharing; Accuracy; Conﬁdence; Improve productivity; Information reuse; Usability;

These aggregated user requirements formed the basis for the analysis of the scenarios and the roadmap production. 3.2 Enabling Technologies In this section we will present and analyse the key technologies that drive text mining, as they have been identiﬁed based on the results of the user requirements analysis and the analysis of market survey. Analysis of Core Technologies The analysis of the core supporting technologies was based on a set of predeﬁned criteria (analysis dimensions), such as the existence of methods and algorithms, the existence of standards, the scalability of the technologies, etc. These were clearly deﬁned in the relevant literature and are presented in the Table 1. Identiﬁed Core Technologies During the production of the strategic roadmap, a ﬁrst indicative set of core supporting technologies has been outlined. This was composed of the following:

114

G. Panagopoulou Methods & Algorithms

Standards

Applicability Analysis Dimensions

Costs

Market

Fig. 2. Dimensions of Analysis

• • • • • • • • • •

Information retrieval/extraction Data mining technologies (algorithms etc.) Intelligent agents Distributed storage and retrieval Natural language processing Multi-lingual processing Statistical analysis methods Semantic web/Ontologies Knowledge discovery Mobility

The scenarios and the roadmap that were developed next focused on these technologies and their potential evolution. Table 1. Criteria for the analysis of enabling technologies Analysis Criteria

Description

Methods/algorithms

• • • • •

Standards

Applicability Costs

Market and reference applications

• • • • • •

Are there any well-accepted methods? Are these methods part of commercial products? Are there any well-accepted standards? Are there competing standardisation organisations? Are these standards supported by scientiﬁc community and/or industry? Are there commercial products available? How robust are they? What is the cost to introduce this technology? What are the maintenance costs? What is the total cost for ownership and ROI (return on investment)? Is there any market?

• Are there successful reference applications/application ﬁelds?

A Strategic Roadmap for Text Mining

115

3.3 The NEMIS Scenario Management Approach For the scenario management in NEMIS, well-accepted methodologies found in the literature were adopted and followed. A short description of the diﬀerent stages of these methodologies follows. Scenario Preparation The scenario preparation is generally the ﬁrst phase of the scenario management. In the case of the NEMIS scenario development the main objective has been to develop unique and representative scenarios, which would take into account both the identiﬁed user requirements as well as the selected core technologies. Normally, the timeframes for scenarios are distinguished into three main categories, namely: • Short-term (2003–2004) • Medium-term (2004–2007) and • Long-term (2008–2010) Scenario Field Analysis The scenario ﬁeld analysis encapsulates the snapshot of the future, by analyzing the existing and emerging inﬂuence parameters. The objective is to identify for each of the scenarios the underlying user requirements and the involved technologies. For deﬁning the analysis dimensions, both the results of the user requirements analysis and consolidation were taken into consideration. On the basis of the STEEP+C approach of the European KM Forum, which identiﬁes the force ﬁeld dimensions (society, technology, environment, economy, politics, culture) the analysis dimensions have been extended on the basis of these driving force ﬁelds. Furthermore, a decision was made to focus on only two of the above dimensions, namely “technology” and “economy”, since these two ﬁelds are those that primarily aﬀect technological evolution. With this background in mind, the technological factors were categorised into three main categories: • The key enabling technologies for realising the selected scenario in general • Further technology requirements and most pressing and challenging theoretical research issues considering the diﬀerent time horizons • The required integration of the key enabling technologies. In the same way, the economic factors have been categorised into: • • • •

Beneﬁts and risks Added value for a speciﬁc technology/application Costs for research and implementation Critical success factors

116

G. Panagopoulou

Projections This is the phase in which the actual “foresight” is performed. This implies that at this stage, future development possibilities are identiﬁed and analysed, on the basis of the previously deﬁned key factors. This is an extremely critical phase, since it inﬂuences the whole quality of the further work (i.e. the scenarios). Scenario Building At this stage, the objective is to develop scenarios in which the alternative future projections ﬁt well to each other. The scenario-building phase is divided into the following four sub-phases: • • • •

Bundling of projections Building of raw scenarios Future area mapping Raw scenario interpretation

3.4 The NEMIS Scenarios In the framework of NEMIS project, four key scenarios have been drafted, which formed the basis of the text mining roadmap development. These scenarios were analysed and studied on the basis of the two main analysis axes discussed previously, namely technology and economy. The target has been to deﬁne disjoint, but innovative and imaginative scenarios, covering diﬀerent facets of the user requirements, as well as of the selected technologies. These scenarios were intended to be general in nature and they were the following: • • • •

Dynamic newspaper On-line (mobile) problem solving Ubiquitous business intelligence Internet-based statistical data collection, via web services

Especially this last scenario has been based on one of the case studies prepared and presented by the NEMIS partners.

4 The NEMIS Roadmap The goal of the NEMIS roadmap has been to provide an integrated view of the future science and technology landscape concerning textmining evolution. In this roadmap, three levels of evolution of the analysed technologies along a timescale from now until the year 2010 have been deﬁned:

A Strategic Roadmap for Text Mining

117

• Phase1: Short-Term Future (2004–2005) • Phase2: Mid-Term Future (2006–2007) • Phase3: Long-Term Future (2008–2010) For the analysis that follows, used tables visualizing the required technology evolution per ﬁeld were developed. Moreover, three diﬀerent types of required actions for further evolution of the selected technologies have been deﬁned, namely: • Basic research – Technology is not very advanced yet. – Basic research is required in order to solve basic problems • Applied research – Technology is quite mature already – Some further research is required in order to solve problems that appear during implementation • Software technology – Technology is already very mature and stable – No basic research is foreseen for the forthcoming years 4.1 Technologies Evolution Forecast For the technologies that have been selected and described previously, the performed analysis predicted that the situation within the years to come would evolve as follows: • Information retrieval/extraction – Information retrieval – applied research for diﬀerent application problems, e.g. cross-language information retrieval (which means using queries in one language in order to search for documents in a diﬀerent language). – Information extraction – basic research since many issues still remain open. • Data mining technologies (algorithms etc.) – Applied research will be required in order to solve problems related mostly to the grammatical and syntactical analysis of documents • Intelligent agents – Although the technology is already quite mature, basic research will be required for speciﬁc problems, e.g., automatic hyperlinking, robust inferencing, etc. • Distributed storage and retrieval – Some topics are still under development and basic research will be required in ﬁelds, like security and trust, authentication, personalisation, etc.

118

G. Panagopoulou

• Natural language processing – NLP is an area of continuous research and development, since language is ﬂexible and dynamic, so NLP technologies should adapt to new situations. For this reason both basic and applied research will be need in diﬀerent areas of NLP, like for example, document generation, robust document classiﬁcation, eﬀective natural language query interpretation, etc. • Multi-lingual processing – Multilingual processing is one of the most challenging ﬁelds for text mining. There is a great need for more research in the area in order to solve many old and new problems and demands. These include topics like, multilingual querying, multilingual document retrieval, reduction of ambiguity, language independent representations • Statistical analysis methods – Due to its nature, statistical analysis methods will always be a ﬁeld of applied research. However, the degree to which this technology has been developed so far is considered as very satisfactory for the ﬁeld of text mining. Therefore, no basic or applied research would be required; rather only software technology will be used. • Semantic web/Ontologies – Although the technology is already quite mature, basic applied research will be required for some issues. These include topics like rule-based semantics, multiple ontologies, semantic bookmarks (ﬁltering), semantic similarity, heterogeneous ontology querying, ontology-based reference models, etc. • Knowledge discovery – Although great progress has been achieved lately in the ﬁeld of knowledge discovery has been performed, research will be need in various aspects of the technology, like for example relational unsupervised data mining, end user data mining, ontology learning, data mining on mixed data, etc. • Mobility – Although this area is currently under rapid development, the following themes will require special attention: wireless networking, trust and security, bandwidth usage and limitations. 4.2 Time-based Evolution Forecast NEMIS Key Scenario I – Dynamic Newspaper This is one of the most up-to-date scenarios, since it is based on some mature technologies and it is less “visionary” than other scenarios. Many “knowledge workers” follows similar approaches for performing their daily tasks. Still, some technologies that aﬀect the full exploitation of underlying techniques need to be further elaborated and some technical problems are yet to be solved.

A Strategic Roadmap for Text Mining

119

To start with, the Semantic Web can be mentioned, which, though attractive and widely accepted, is still far from being considered as a mature technology. Problems exist in content creation and organization, especially with regard to supporting information (e.g. metadata). Moreover, the Semantic Web depends on the development of other technologies like NLP and Knowledge Discovery. With regard to Mobility, in the short term it is expected that many devices would become available supporting new features, but still the issues of security, privacy, trust and proof will remain open. It is expected that these and related issues will be more clearly deﬁned and taken into consideration in the mid term future. Table 2. NEMIS Key Scenario I – Technology Evolution 2003

2004

2005

2006

2007

2008

2009

2010

Information retrieval / extraction Data mining technologies Intelligent agents Distributed storage and retrieval Natural language processing Multilingual processing Semantic web / Ontologies Knowledge discovery Mobility

NEMIS Key Scenario II – On-Line (Mobile) Problem Solving This scenario is based on aspects of technologies, some of which are already well advanced, like statistical analysis methods. Still, some basic research is required in individual dimensions of the technologies. The evolution of the technologies is quite diﬀerent from what was present in the previous scenario, due to the fact that diﬀerent parameters of the same technology are required to evolve in this case. The technology evolution framework is presented in the next table. Table 3. NEMIS Key Scenario II – Technology Evolution 2003 Information retrieval / extraction Data mining technologies Intelligent agents Distributed storage and retrieval Statistical analysis methods Semantic web / Ontologies Knowledge discovery Mobility

2004

2005

2006

2007

2008

2009

2010

120

G. Panagopoulou

NEMIS Key Scenario III – Ubiquitous Business Intelligence In this scenario, two basic characteristics that aﬀect the technology roadmap were identiﬁed. First of all, the technologies involved are quite diverse. Although there are obvious synergies possible (like for instance between the Semantic Web and NLP), their development is by no means aligned. In addition, in most cases basic research takes place within academic and scientiﬁc communities, therefore, the knowledge and technology transfer to market can take a long time. On the other hand, some of the technologies are already well developed, represented by numerous products, which share functions and objectives. Those products vary a lot in complexity, maturity, price and function. Table 4. NEMIS Key Scenario III – Technology Evolution 2003

2004

2005

2006

2007

2008

2009

2010

Information retrieval / extraction Data mining technologies Intelligent agents Distributed storage and retrieval Natural language processing Statistical analysis methods Semantic web / Ontologies Knowledge discovery

NEMIS Key Scenario IV – Internet-Based Statistical Data Collection This was the most pressing scenario, since it comes to respond to a real need of some of the most demanding users. The big advantage for this case was the fact that the required dimensions of the enabling technologies are quite well advanced already and there are only minor issues that require basic research. To the contrary of what was common in the previous cases, in this special scenario, political decisions and support are probably more important than technology evolution itself. For example, in the case of multilingual processing, the requirements could be easily met only with the development of common terminology vocabularies.

A Strategic Roadmap for Text Mining

121

Table 5. NEMIS Key Scenario IV – Technology Evolution 2003

2004

2005

2006

2007

2008

2009

2010

Information retrieval / extraction Data mining technologies Intelligent agents Natural language processing Semantic web / Ontologies Multilingualism

5 Summary and Conclusions In this paper, the methodology followed for the production of a strategic roadmap for text mining has been presented, together with the results of this process. The results of the diﬀerent stages of the process (user requirements analysis, core technologies analysis, scenarios management, etc.) were outlined. Finally, the selected scenarios were studied in terms of technological and economic aspects and speciﬁc technology roadmaps for each of the scenarios were produced, as well as for the enabling technologies as a whole. In any case, the work related to the development of technology roadmaps for text mining can in no terms be considered as ﬁnished, since the continuously emerging user requirements and the evolution of technologies alter the parameters of this problem.

References 1. Atlantic Canada Opportunities Agency: Technology Roadmap (1999) http://www.acoa.ca/e/library/reports/roadmap.pdf 2. Cook, J.: Use Roadmaps – Don’t Get Lost in the Wilderness. IMTI, INC. ASEM National Conference, Huntsville, Alabama (2001) http://www.imti21.org/ presentations/technology%20roadmapping%20overview ﬁles/frame.htm 3. Damelio, R.: The Basics of Process Mapping. Paperback (1996) http://www. books.brint.com/#km 4. Duckles, J.M., Coyle E.J.: Technology Roadmapping: a Resource for Research and Education in Technology Roadmapping. Purdues’s Center for Tech. Road. (2002) http://roadmap.ecn.purdue.edu/ctr/documents/centerfortechnologyroadmapping.pdf 5. Dept. of Industry, Science and Resources: Technology Planning for Business Competitiveness – A Guide to Developing Technology Roadmaps. Emerging Industries Section, Australia (2001) http://www.industry.gov.au/library/ content library/13 technology road mapping.pdf 6. European Industrial Research Mgmt Assoc: Technology RoadmappingDelivering Business Vision. (1998) http://www.eirma.asso.fr/pubs/rep52/ abstract52.html 7. Galloway, D.: Mapping Work Processes. (1994) http://www.books.brint. com/#km

122

G. Panagopoulou

8. Kostoﬀ, R.N., Schaller, R.R.: Science and Technology Roadmaps. (1999) http://www.onr.navy.mil/sci tech/special/technowatch/docs/mapieee10.doc 9. Dean Meyer and Associates Inc.: Roadmap – How to Understand, Diagnose and Fix your Organisation. (1997) http://www.ndma.com/products/rm/broch.htm 10. Macintosh, A., Filby, I., Tate, A.: Knowledge Asset Road Maps. (1998) http://www.aiai.ed.ac.uk/∼oplan/documents/1998/98-pakm98-roadmaps.pdf 11. Schaller, B.: Master Roadmap Bibliography. George Mason University (1999) http://www.iso.gmu.edu/∼rschalle/master.html 12. Schaller, R.R.: Technology Roadmaps: Implications for Innovation, Strategy and Policy. The Institute of Public Policy, George Mason University, Fairfax (1999) http://mason.gmu.edu/∼rschalle/rdmprop.html 13. SANDIA Nat. Labs, Strategic Business Development Department: Fundamentals of Technology Roadmapping (1998) http://www.sandia.gov/roadmap/ home.htm 14. Univ. of Cambridge: Technology Foresight and Strategic Planning: Future Technologies. (2001) http://www.sabanciuniv.edu/foresight2001/clare trm.pdf 15. U.S. Dept. of Energy: Applying Science and Technology Roadmapping in Environmental Management. (2000) http://emi- web.inel.gov/roadmap/guide.pdf 16. White, D.: Knowledge Mapping and Management. Paperback (2002) http://www.books.brint.com/#km

Text Mining Applied to Multilingual Corpora Federico Neri and Remo Raﬀaelli Research & Development Dept. SYNTHEMA S.r.l., Lungarno Mediceo 40, I56127 Pisa (PI), Italy {neri, raffaelli}@synthema.it

Abstract. Up to 80% of electronic data is textual and most valuable information is often encoded in pages which are neither structured, nor classiﬁed. Documents are – and will be – written in various native languages, but these documents are relevant even to non-native speakers. Nowadays everyone experiences a mounting frustration in the attempt of ﬁnding the information of interest, wading through thousands of pieces of data. The process of accessing all these raw data, heterogeneous for language used, and transforming them into information is therefore inextricably linked to the concepts of textual analysis and synthesis, hinging greatly on the ability to master the problems of multilinguality. Through Multilingual Text Mining, users can get an overview of great volumes of textual data having a highly readable grid, which helps them discover meaningful similarities among documents and ﬁnd all related information. This paper describes the approach used by SYNTHEMA1 for Multilingual Text Mining, showing the classiﬁcation results on around 600 breaking news written in English, Italian and French.

1 The Methodology 1.1 Linguistic Preprocessing and Multilingual Resources Construction Generally speaking, the manual construction and maintenance of multilingual language resources is undoubtedly expensive, requiring remarkable eﬀorts. The growing availability of comparable and parallel corpora have pushed SYNTHEMA to develop speciﬁc methods for semi-automatic updating of lexical 1

Being established in 1994 by computer scientists from the IBM Research Center, with the expertise and skills suited to provide eﬀective software solutions, as well as carry out R&D in Natural Language Processing area, SYNTHEMA has been involved in Machine Translation, Information Extraction and Text Mining activities since 1996, primarily in the ﬁeld of Technology Watch.

F. Neri and R. Raﬀaelli: Text Mining Applied to Multilingual Corpora, StudFuzz 185, 123–131 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

124

F. Neri and R. Raﬀaelli

resources. They are based on Natural Language Understanding and Machine Learning. These techniques detect multilingual lexicons from such corpora, by extracting all the meaningful terms or phrases that express the same meaning in comparable documents. As a case study, let us consider a corpus made of around 350 parallel breaking news written in English, French and in Italian, used as training set for the topic of interest. English has been used as reference language. The major problem consists in the diﬀerent syntactic structure and words deﬁnition these languages may have. So a direct phrasal alignment has been often needed. The following bilingual morphological analysis – Italian vs English, French vs English – recognises as relevant terminology only those terms or phrases, that exceed a threshold of signiﬁcance. A speciﬁc algorythm associates an Information Quotient to each detected term and ranks it on its importance. The Information Quotient is calculated taking in account the term, its Part of Speech tag, its relative and absolute frequency, its distribution on documents. This morphological analysis detects signiﬁcant Simple Word Terms (SWT) and Multi Word Terms (MWT), annotating their headwords, their relative and absolute positions. SYNTHEMA strategy on multilingual dictionary construction consists in the assumption that, having taken in account a speciﬁc term S and its phrasal occurrencies, its translation T can be automatically detected by analysing the correspondant translated sentences. Thus, semi-automatic lexicon extraction and storage of multilingual relevant descriptors become possible (see Figs. 1–2). Each multilingual dictionary, speciﬁcally suited for the cross-lingual mapping, is bi-directional and contains multiple coupled terms f (S, T ), stored as Translation Memories. Each lemma is referenced to syntax or domain dependent translated terms,

Fig. 1. Bilingual morphological and statistical analysis

Text Mining Applied to Multilingual Corpora

125

Fig. 2. Terms matching and context visualization

so that each entry can represent multiple senses. Besides, the multilingual dictionaries contain lemmas together with simple binary features, as well as sophisticated tree-to-tree translation models, which map – node by node – whole sub-trees [1]. For this case study, the multilingual dictionary is made of around 2.000 entries. 1.2 Linguistic Analysis The automatic Linguistic Analysis is based on Parsing, Morphological and Statistical rules. The Parsing analysis is based on a set of pre-deﬁned rules, which specify the most relevant ﬁelds in documents and their main features: the label identifying the ﬁeld of interest, or the masks that need to be applied to extract the main information included in the ﬁeld, or to split it into its components (i.e. an author ﬁeld in a scientiﬁc article can be normally split into surname/name/company/place). The automatic linguistic analysis of free textual ﬁelds is based on Morphological and Statistical criteria. This phase is intended to identify only the signiﬁcant expressions from the whole raw text. This analysis recognises as relevant terminology only those terms or phrases that comply with a set of predeﬁned morphological patterns (i.e.: noun+noun and noun+preposition+noun sequences) and whose frequency exceeds a threshold of signiﬁcance. The detected terms and phrases are then extracted, reduced to their Part of Speech tagged base form [2, 3, 4, 5]. Once referred to their language independent entry inside the multilingual dictionary, they are used as descriptors for documents [6, 7]. Indexation based on terminology detection is extremely reliable for managing any type of documentation, expecially if it is technical and scientiﬁc. In fact, unfortunately, few of us have complete knowledge about the world.

126

F. Neri and R. Raﬀaelli

Fig. 3. Multilingual dictionary editing

Fig. 4. Translation memories

And, in the consequence of this, the meanings we ascribe to words may diﬀer from those ascribed by others. The same happens with lexical tools capable of syntactic parsing, which have always a limited capability of semantic interpretation and disambiguation, if applied to generic corpora. In such situations, these tools cannot pick out the exact interpretation for all expressions

Text Mining Applied to Multilingual Corpora

127

Fig. 5. Grammar and Statistical rules

in the language. Besides, main terminology – mostly compound nouns – helps “understand” the topic, being intrinsically linked to semantics [8]. Figure 7 reports the clustering support for all the diﬀerent Part of Speeches, hilitening the segmentation support of noun phrases. The indexed documents then can be exported directly to a database, where they can be searched for, accessed and classiﬁed into thematic groups. 1.3 Classiﬁcation The classiﬁcation is made by Online Miner Light, which is an application developed by TEMIS2 jointly with SYNTHEMA, and fulﬁls the following requirements: • Unsupervised Classiﬁcation. The application dynamically discovers the thematic groups that best describe the detected documents, according to the K-Mean approach. This phase allows users to access documents by topics, not by keywords. 2

TEMIS was established in 2000 as a Technology & Consulting Company, specialized in Text Intelligence and Advanced Computational Linguistics to develop applications related to Competitive Intelligence, Customer Relationship Management and Knowledge Management.

128

F. Neri and R. Raﬀaelli

Fig. 6. Information extracted

Fig. 7. Clustering support

Text Mining Applied to Multilingual Corpora

129

• Hierarchical Classiﬁcation. This makes it possible to explore in depth thematic groups, subdividing them into more speciﬁc themes. The application provides a visual summary of the analysis (See Fig. 9). A map shows the diﬀerent groups as diﬀerently sized bubbles (the size depends on the number of documents the bubble contains) and the meaningful correlation among them as lines drawn with diﬀerent thickness (that is level of correlation). Users can search inside topics and have a look of the documents populating the clusters. The output results can be viewed by a simple Web browser. As an example, let us classify all the 483 documents which are the result of a speciﬁc query on the application database. We obtain 10 well-deﬁned clusters (see Fig. 8), dealing with terrorism and war (cluster 1), palestinian crisis (cluster 2), italian politics (cluster 3, 4, 5), italian school (cluster 6), economy (cluster 7), child kidnapping (cluster 8), illegal immigration (cluster 9) and general themes (cluster 10). Having a look of the thematic network, the results are similar to what everyone would expect from reading these type of documents: all the clusters regarding politics are linked together, as well as all the documents about the israeli-palistinian crisis, peace and war, etc. When searching for “insemination” inside the “bubbles map”, the system hilites all the clusters which contains documents having “insemination” as lexical descriptor, allowing to access them (see Fig. 10). We obtain documents dealing with “inseminazione”,

Fig. 8. Clustering results

130

F. Neri and R. Raﬀaelli

Fig. 9. Thematic map and search in topics

Fig. 10. Documents visualization

Text Mining Applied to Multilingual Corpora

131

“fecondazione”, “legge sulla fecondazione”, “sterilit` a”, “fecondazione assistita”, “artiﬁcal insemination”, “insemination intervention”, etc.

2 Conclusions This paper describes a new approach used in Text Mining applied to multilingual corpora and a speciﬁc case study made on around 600 English, French and Italian Breaking News, directly downloaded from MISNA, AGI and from some French news agencies. Terminologies and Translation Memories permit to overcome linguistic barriers, allowing the automatic indexation and classiﬁcation of documents, whatever it might be their language. This new approach enables the research, the analysis, the classiﬁcation of great volumes of heterogeneous documents, helping people to cut through the information labyrinth. Being multilinguality an important part of this globalised society, Multilingual Text Mining is a major step forward in keeping pace with the relevant developments in the challenging and rapidly changing world.

References 1. Cascini, G., Neri, F.: Natural Language Processing for Patents Analysis and Classiﬁcation, ETRIA World Conference, TRIZ Future 2004, Florence, Italy. 2. Raﬀaelli, R.: An inverse parallel parser using multi-layerd grammars, IBM Technical Disclosure Bullettin, 2Q, 1992. 3. Raﬀaelli, R.: Un ambiente per lo sviluppo di grammatiche basato su un parser inverso, parallelo e seriale, IBM Italy Scientiﬁc Centers Technical Report, pp. 1– 19, 1992. 4. Marinai, E., Raﬀaelli, R.: The design and architecture of a lexical data base system, COLING’90, Workshop on advanced tools for Natural Language Processing, Helsinki, Sweden, Aug 1990, 24. 5. Raﬀaelli, R.: ABCD – A Basic Computer Dictionary, Proceedings of ELS Conference on Computational Linguistics, Kolbotn, Norway, Aug 1988, 30–31. 6. Galli, G., Raﬀaelli, R., Saviozzi, G.: Il trattamento delle espressioni composte nel trattamento del linguaggio naturale. IBM Research Center, internal report, Pisa, Italy, pp. 1–19, 1992. 7. Elia, A., Vietri, S.: Electronic dictionaries and linguistic analysis of Italian large corpora. JADT 2000, 5th International Conference on the Statistical Analysis of ´ Textual Data, Ecole Polytechnique F´ed´erale de Lausanne, Lausanne, Switzerland, pp. 2–4, 2000. 8. Neri, F.: Information Search and Classiﬁcation to foster Innovation in SMEs. The AREA Science Park experience, Text Mining and its application, CRM and Knowledge Management, Advances in Management Information Vol. 2, ISBN:185312-995-X, WITPress, Southampton (UK), forthcoming.

Content Annotation for the Semantic Web Thierry Poibeau Laboratoire d’Informatique de Paris-Nord, CNRS UMR 7030 and University Paris 13, 99, avenue J.-B. Cl´ement – F-93430 Villetaneuse [email protected]

Abstract. This paper is intended to show how an Information extraction system can be recycled to produce RDF schemas for the semantic web [1]. We demonstrate that this kind of systems must respect operational constraints like the fact that the information produced must be highly relevant (high precision, possibly bad recall). The production of explicit structured data on the web will lead a better relevance of information retrieval engines.

1 Introduction Information Extraction (IE) is a technology dedicated to the extraction of structured information from texts. This technique is used to highlight relevant sequences in the original text or to ﬁll pre-deﬁned templates [2]. A well-known problem of such systems is the fact that moving from one domain to another means re-developing some resources, which is a boring and time-consuming task (for example [3] mentions a 1500 hours development). Moreover, when information is often changing (think of the analysis of a newswire for example), one might want to elaborate new extraction templates. This task is rarely addressed by the research studies in IE system adaptation, but we noticed that it is not an obvious problem. People are not aware of what they can expect from an IE system, and most of the time they have no idea of how deriving a template from a collection of texts can be. On the other hand, if they deﬁned a template, the task cannot be performed because they are waiting for information that is not contained in the texts. In order to decrease the time spent on the elaboration of resources for the IE system and guide the end-user in a new domain, we suggest to use a machine learning system that helps deﬁning new templates and associated resources. This knowledge is automatically derived from the text collection, in interaction with the end-user to rapidly develop a local ontology giving an accurate image of the content of the text. The experiment also aims at reaching a better coverage thanks to the generalization process provided by the machine learning system. T. Poibeau: Content Annotation for the Semantic Web, StudFuzz 185, 133–145 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

134

T. Poibeau

We will ﬁrstly present the overall system architecture and principles. The learning system is then what allows the learning of semantic knowledge to help deﬁne templates for new domains. We will show to what extent it is possible to speed up the elaboration of resources without any decrease in the quality of the system. We will ﬁnish with some comments on this experiment and we will show how domain-speciﬁc knowledge acquired by the learning system such as the sub-categorization frame of verbs could be used to extract more precise information from texts.

2 Description of the Information Extraction System The system architecture consists in a multi-agent platform. Each agent performs a precise subtask of the information extraction process. A supervisor controls the overall process and the information ﬂow. The overall architecture is presented below. The system can be divided into ﬁve parts: information extraction from the structure of the text, the module for named entity recognition (location, dates, etc), semantic ﬁlters, modules for the extraction of speciﬁc domain-dependent information and modules for the ﬁlling of a result template. • Some information is extracted from the structure of the text. Given that the AFP newswire is formatted, some wrappers automatically extract information about the location and the date of the event. This non-linguistic extraction increases the quality of the result by providing 100% good results. It is also accurate when one thinks of the current development of structured text (HTML, XML) via the web and other corporate networks. • The second stage is concerned with the recognition of relevant information by means of a linguistic analysis. This module recognizes various named entities (person names, organizations, locations and dates) from the text. New kinds of named entities can be deﬁned according to a new domain (for examples, gene names to analyze a genome database). We use the ﬁnitestate toolbox Intex to design dictionaries and automata [4]. • The third stage performs text categorization from the seek of “semantic signatures” automatically produced from a rough semantic analysis of the text. • The fourth stage extracts speciﬁc information (most of time, speciﬁc relationships between named entities). It can be for example the number of victims of a terrorist event. This step is achieved in applying a grammar of transducers (extraction patterns) over the text. • The next stage links all these information together to produce one or several result template(s) that present(s) a synthetic view of the information extracted from the text. The template corresponding to the text is chosen among the set of all templates, according to the identiﬁed category of the text (registered by the system at the third analysis step). A speciﬁc

Content Annotation for the Semantic Web

135

template is produced only if some main slots are ﬁlled (the system distinguished among obligatory and optional slots). Partial templates produced by diﬀerent sentences are merged to produce only one template per text. This merging is done under constraints on what can be uniﬁed or not. The results are then stored in a database, which exhibit knowledge extracted from the corpus. The architecture exhibits, outside from the information extraction system in itself, a machine learning module that can help the end-user produce resources for information extraction. The end-user who wants to deﬁne a new extraction template has to process a representative set of documents in the learning module to obtain an ontology and some rough resources for the domain he wants to cover.

3 IE Application Overview: Knowledge Extraction from Various Domains Information Extraction (IE) is a technology dedicated to the extraction of structured information from texts. This technique is used to highlight relevant sequences in the original text or to ﬁll pre-deﬁned templates [2]. With the development of the semantic web, such tools appear to be very interesting to automatically extract semantic information from existing web pages. The IE system we have developed has been applied on various domains, such as: • Event-based extraction and indexing of the French AFP newswire. This multi-domain extraction system is currently running in real time, on the AFP newswire. About 15 templates have been deﬁned that cover about 30% of the stories. From the remaining 70%, the system only extract surface information, especially thanks to the wrappers. The performances are between .55 and .85 P&R, if we do not take into account the date and location slots that are ﬁlled by means of wrappers. New extraction templates are deﬁned to prove system scalability. • Event-based extraction from ﬁnancial news stories (FirstInvest, a French ﬁnancial website) or Business Intelligence about the communication domain. • Extraction of gene interactions from the genomics database Flybase. This kind of bases are structured by gene description, but researchers want to ﬁnd relations among genes. In this context, the IE engine is intended to automatically produce a knowledge base about gene interaction, from the analysis of free texts. • Customer Request Management application (extracting information from emails relating software problems). This last case poses the problem of email analysis and management. The language used in such texts is not as correct as it can be in news stories. Speciﬁc grammar and orthographic relaxations must be applied to achieve relevant results.

136

T. Poibeau

Fig. 1. Information extraction from a tele-communication corpus

Fig. 2. Resource deﬁnition for an IE application from genomic texts

The last three applications concern texts from the Internet. FirstInvest is an electronic ﬁnancial newswire available on the Web. Flybase, like other electronic databases in genomics, is a collection of public data freely available for researchers. The CRM application concerns a currently very popular area, which is also related to Knowledge Management. This domain is also largely dependent on the multilinguality problem.

4 Multilingual Named Entity Recognition We present in this section a tool developed to analyze multilingual named entities. Resources have already been developed for the following languages: Arabic, Chinese, English, French, German, Japanese, Finnish, Malagasy, Persian, Polish, Russian, Spanish and Swedish.

Content Annotation for the Semantic Web

137

Fig. 3. Overview of the CRM application (the information base in the foreground and an annotated text in the background)

4.1 Multilingualism Issues Languages vary a lot in their characteristics, in their writing systems as much as in their grammar. Moreover, language technology is not much developed for most of them. This has a big consequence for named entity recognition: for certain languages like most of the European languages, we beneﬁt from already existing lexical resources. For other languages, a lot of work still needs to be done. For example, there is no dictionary available for Malagasy and even electronic resources and corpora are rare. All the texts and resources we will describe are encoded using the Unicode standard (Unicode Little-Endian). This strategy allows most of the encoding problems to be solved, even if some bugs still remain from time to time for a given language (for example, writing direction problems in Arabic, when characters appears from the left to the right, while it should be the contrary, etc.). 4.2 Overall System Architecture In spite of diﬀerences in their implementation, each system elaborated for the diﬀerent languages shares approximately the same architecture. The text is ﬁrstly analyzed by a classical rule-based system. This analysis is then completed by dynamic acquisition mechanisms (theory learning) and revision capabilities (see Fig. 1).

138

T. Poibeau

Text

Rule-based Systems

dictionary (3) Dynamic acquisition from the text

(1) Lexical analysis gramm (2) Grammar application

(4) Revision mechanisms Annotated Text

Fig. 4. Architecture of the system

We detail below these 4 main knowledge sources: Gazetteers Their role is disputed since the appearance of ML techniques allowing previously unknown named entities to be acquired from tagged corpora. However, it is simply, most of the time, not realistic to tag large amount of corpus [5]. Moreover, tagging great amounts of data can be compared to the elaboration of dictionaries1 . Grammar Its aim is to group together elements pertaining to the same entity. A grammar rule is generally made of a trigger word, some tagged words and occasionally unknown words. These words can be accurately tagged given an appropriate context (especially if a trigger word disambiguates the sequence). 1

If one analyzes a text to tag person names, it is then easy to write a simple program that will automatically extract the sequences previously tagged to generate a dictionary. In this sense, tagging is not that diﬀerent from elaborating a dictionary!

Content Annotation for the Semantic Web

139

Learning Capabilities We include, in this section, ML algorithms used to tag unknown named entities. Most ML techniques have been used including maximal entropy, inductive logic programming, decision tree learning, hidden Markov models and others [6, 7, 8, 9]. We use a kind of theory learning to extend the set of expressions identiﬁed by the rule-based system: the lexicon and the grammar is exploited as a domain theory to dynamically ﬁnd new entities [10]. Revision Capabilities We implemented revision capabilities in the system so that it can revise tags in a certain context. For example, in an English text, isolated occurrences of Washington can be considered as location names. If one ﬁnds a context that potentially suggests another category for the named entity (for example, Mrs. Washington) the system will revise the initial tag and put the new category on the concerned word (isolated occurrences of Washington will be tagged as person names). 4.3 Implementation Rule-based systems have been developed for English and French using the Intex/Unitex ﬁnite state toolbox [4]. The resulting system has been described in [11]. Resources are currently being deﬁned and adapted to other languages like Russian (Cyrillic alphabet) or Arabic and Persian (Arabic writing system). For Asian languages, like Japanese, which makes use of 4 diﬀerent writing systems (hiragana, katakana, kanji and romanji), the Intex/Unitex was not eﬃcient. Thus, Japanese is processed at ﬁrst by the Chasen morphological analyzer [16]. Perl scripts are then applied on top of the Chasen analysis to produce a tagged text with highlighted named entities. Even if the Chasen analyser uses the JIS format, the ﬁnal output is encoded using the Unicode standard. Once the system is adapted, the same strategy is adapted to the diﬀerent languages. A set of trigger words is deﬁned, along with a proper names dictionary and a named entity grammar for the concerned language. The dynamic named acquisition mechanisms implemented are classical and have been described with details in [11]. 4.4 Resource Sharing While developing the system for diﬀerent Indo-European language, we saw that resources could be shared by diﬀerent languages. For example, proper name dictionaries for French and English are very similar. One has just to remove entries from the English dictionary that would be too ambiguous in

140

T. Poibeau

French. A large part of the grammar can also be re-used provided that the grammar rules are carefully cheked and appropriate modiﬁcations are made (list of trigger words, etc.). Of course, these resources must be completed to properly cover the new language and/or the new domain. The same approach seems to be valid for other romance languages (Italian, Spanish). For Germanic and Slavic languages, dictionaries must be modiﬁed to take into account inﬂectional forms. A large amount of work is then needed to modify and adapt dictionaries ﬁrstly developed for English (add an inﬂectional code on each word; This code is language-dependent). The approach has not been investigated for non Indo-European languages. 4.5 Evaluation The system is under implementation. A complete evaluation is then impossible but we present in this section some ﬁrst results. Overall Performances For the moment, only the English and the French systems have been intensively tested. Their performance is comparable to systems having participated to MUC conferences (P&R is the combined value of precision and recall [12]. Table 1. Performances on the MUC-6 corpus [12]

BBN SRA NYU U. Sheﬃeld Our system

Recall

Precision

P&R

.98 .97 .94 .84 .86

.98 .99 .99 .96 .95

.98 .98 .96 .90 .90

Their performance has also been tested on diﬀerent corpora and it appears that these hybrid systems are less sensitive to corpus or domain changes than classical rule-based systems [11]. Other Experiments The developed systems are systematically tested on the Monde Diplomatique corpus (when available!), a multilingual international journal published in 10 languages on the web. We hope to achieve for most of the other languages under implementation better or similar results to the ones obtained for French and English. This multilingual named entity recogniser is already used in a wider project concerning corpus alignment. The idea is to use cognates and named entities as cues for sentence alignment.

Content Annotation for the Semantic Web

141

Result Overview This section presents some screenshots obtained from diﬀerent languages such as English:

Arabic:

142

T. Poibeau

or Russian:

5 Semantic Annotations and Other Outputs The system currently produces various kinds of output, for example: XML (/HTML) tagged texts for named entity highlighting in texts; Event database for the analysis of the AFP newswire. A new template is produce for each new event. A dynamic knowledge base for gene interaction (a query-able knowledge base made of Prolog-like terms) activation(1.28,Dfd) activation(5-HT1A,C) activation(ac,E) activation(Ac13E,G) activation(Dfd,1.28) interaction(2R-F,mys) interaction(2R-L,mys) Fig. 5. A part of the knowledge base generated from the analysis of Flybase

The range of performance is generally located between 60 and 80 P&R2 . However, it is possible to semi-automatically adapt the system so that precision is very high. This point is crucial to produce high quality data for subsequent processing. The system then has a lower recall than classical IE tools 2

P&R is the harmonic means of recall and precision. This metric is classical to measure the performance of ﬁltering and extraction systems.

Content Annotation for the Semantic Web

143

(recall between .30 and .50; in the genomics domain, we often see a precision above .95 with a recall of .15, which is not a problem as such since genomics databases are highly redundant. Of course, an eﬀort is made to produce data with the same precision but with a higher recall).

6 Deriving Ontologies from Texts Once the IE system is adapted to produce high quality annotations, it is possible to directly generate semantic data for the semantic web in RDF. It is only necessary to slightly change the output format. The following ﬁgure is the excerpt of a RDF schema coding information concerning gene interaction: <subissant rdfs:Resources="#trx" /> <article rdfs:Resources="#http://www.flybase.indiana.edu"/> Fig. 6. Part of the RDF resource produced from the analysis of Flybase

Even if IE systems are not yet directly able to process large amount of texts from the Internet, it is demonstrated that they can easily be applied to large domains on the Web. Two strategies are possible: • Batch processing of large amount of texts. This approach is possible with data such as Flybase. The whole database can be downloaded and processed to automatically derive and generate new kinds of structured data. • Processing on the ﬂy of reduced texts. This strategy is applied in production services. For example, a news agency like AFP is using Information Extraction to automatically tag texts before putting them on the Web. This task was previously manual and subjective. IE annotation tools allow to produce more structured and more systematic databases. IE appears then very appropriate to produce RDF and other annotation for the semantic web. These RDF schemas can be considered as pieces of structured information. Gathered on a large scale, they produce a partial ontology of the domain [13]. This experiment has been partially done on Flybase, for knowledge annotation, normalization and structuring. The quality of an automatic extraction process is not suﬃcient for a direct use (precision and recall are generally located between 0.3 and 0.5). It is thus necessary to manually validate these results in order to produce usable data.

144

T. Poibeau

7 Related Work Many authors have shown how it is possible to automatically derive knowledge from texts, using syntactic regularities. For example, Grefenstette [14] proposed to automatically acquire semantic classes from texts with an unsupervised model. Reference [3] presents an experiment to learn concepts from texts. She uses lists of nouns representing general concepts (seeds) and a method based on co-occurrence detection to augment theses lists to concepts. The augmented lists are checked by the expert who only retains nouns representing actual concept. Our approach, like the one from Riloﬀ, includes a phase of manual validation: this is the only realistic approach to obtain high quality information extraction and usable ontologies from information processing.

8 Conclusion In this paper we have shown that a versatile IE system is very appropriate to automatically analyze unstructured texts from the web and produce semantic annotations. Some researches still need to be done to produce more robust IE tools that will be able to deal with various kind of texts. In particular, it is necessary to mix NLP approach with wrappers to make good use of semi-structured texts [15]. These systems should change for a bit the face of the Web. Given that more and more structured and semantically annotated data will be available, Question Answering systems should give more accurate answers to user requests, for example. In this sense, IE systems allow to really extract and structure the semantic of the Web.

References 1. W3C. 1999. Resource Description Framework (RDF) Model and Syntax, W3C Recommendation, 22 Feb. 1999. 2. Pazienza M.T. (ed.) 1997. Information extraction (a multidisciplinary approach to an emerging information technology), Springer Verlag (Lecture Notes in Computer Science), Heidelberg, Germany. 3. Riloﬀ E. 1996. “Automatically generating extraction patterns from untagged text”. In Proceedings of the 13th International Conference on Artiﬁcial Intelligence (AAAI’96), Portland, pp. 1044–1049. 4. Silberztein M. 1993. Dictionnaires ´ electroniques. Masson, Paris. 5. Introduction to Information Extraction Technology. 1999. Tutorial, International Joint Conference on Artiﬁcial Intelligence’99). Stockholm, Sweden (available at: http://www.ai.sri. com/∼appelt/ie-tutorial/) 6. Bikel D., Miller S., Schwartz R. and Weischedel R. 1997. Nymble: a high performance learning name-ﬁnder. In Proceeding of the 5th ANLP Conference, Washington, USA.

Content Annotation for the Semantic Web

145

7. Collins M. and Singer Y. 1999. Unsupervised models for named entity classiﬁcation. In Proceedings of EMNLP/WVLC, 1999, MA, pp. 189–196. 8. Bechet F., Nasr A., Genet F. 2000. Tagging Unknown Proper Names Using Decision Trees. In Proceedings of the 38th ACL Conference, Hong-Kong, pp. 77– 84. 9. Mikheev A., Moens M. and Grover C. (1999) Named Entity recognition without gazetteers. In Proceedings of the Annual Meeting of the European Association for Computational Linguistics EACL’99, Bergen, Norway, pp. 1–8. 10. Mooney R. 1993. Induction over the unexplained: using overly general domain theories to aid concept learning, Machine Learning, 10:79. 11. Poibeau T and Kosseim L. 2001. Proper-name Extraction from Non-Journalistic Texts. Proceeding of the 11th Conference Computational Linguistics in the Netherlands, Tilburg. Netherlands, Rodopi. 12. MUC-6. 1995. Proceedings of the Sixth Message Understanding Conference (DARPA), Morgan Kaufmann Publishers, San Francisco. 13. Poibeau T., Arcouteil A. and Grouin C. 2002. “Recycling an Information Extraction system to automatically produce Semantic Annotations for the Semantic Web”. Proceeding of the semantic annotation workshop, during the European Conference on Artiﬁcial Intelligence (ECAI 2002), Lyon. France. 14. Grefenstette G. “Sextant: Exploring unexplored contexts for semantic extraction from syntactic analysis”. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (ACL’92), Newark, 1992, pp. 324– 326. 15. Muslea I. 1999. Extraction patterns for Information Extraction tasks: a survey, AAAI’99 (cf. http://www.isi.edu/∼muslea/RISE/ML4IE/) 16. Asahara M., Matsumoto M. (2000) Extended Models and Tools for High-performance Part-of-Speech Tagger”. In Proceedings of Coling’2000, Saarbr¨ ucken, Germany, pp. 21–27.

An Open Platform for Collecting Domain Speciﬁc Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications, 15310 Aghia Paraskevi Attikis, Athens, Greece {vangelis, costass}@iit.demokritos.gr iit.demokritos.gr/skel Abstract. The paper presents a platform that facilitates the use of tools for collecting domain speciﬁc web pages as well as for extracting information from them. It also supports the conﬁguration of such tools to new domains and languages. The platform provides a user friendly interface through which the user can specify the domain speciﬁc resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various conﬁgurations. The platform design is based on the methodology proposed for web information retrieval and extraction in the context of the R&D project CROSSMARC.

1 Introduction The growing volume of web content in various languages and formats, along with the lack of structured information and the information diversity have made information and knowledge management a real challenge towards the eﬀort to support the information society. Enabling large scale information extraction (IE) from the Web is a crucial issue for the future of the Internet. The traditional approach to Web IE is to create wrappers, i.e. sets of extraction rules, either manually or automatically. At run-time, wrappers extract information from unseen collections of Web pages, of known layout, and ﬁll the slots of a predeﬁned template. The manual creation of wrappers presents many shortcomings due to the overhead in writing and maintaining them. On the other hand, the automatic creation of wrappers (wrapper induction) presents also problems since a re-training of the wrappers is necessary when changes occur in the formatting of the targeted Web site or when pages from a “similar” Web site are to be analyzed. Training an eﬀective siteindependent wrapper is an attractive solution in terms of scalability, since any

V. Karkaletsis and C.D. Spyropoulos: An Open Platform for Collecting Domain Speciﬁc Web Pages and Extracting Information from Them, StudFuzz 185, 147–157 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

148

V. Karkaletsis and C.D. Spyropoulos

domain-speciﬁc page could be processed, without relying heavily on the hypertext structure. The collection of the application speciﬁc web pages which will be processed by the wrappers is also a crucial issue. A collection mechanism is necessary for the location of the application speciﬁc web sites and the identiﬁcation of interesting pages within them. The design and development of web pages collection and extraction systems needs to consider requirements such as enabling adaptation to new domains and languages, facilitating maintenance for an existing domain, providing strategies for eﬀective site navigation, ensuring personalized access, and handling of structured, semi-structured or unstructured data. The implementation of a web pages collection and extraction mechanism that addresses eﬀectively these important issues was the motivation for the R&D project CROSSMARC1 , which was partially funded by the EC. CROSSMARC work resulted to a system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports conﬁguration of the system to new domains and languages. Based on the methodology proposed in CROSSMARC, we started the development of a new platform to facilitate the use of collection and extraction tools as well as their customization. The platform provides a user friendly interface through which the user can specify the domain speciﬁc resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various conﬁgurations. The current version of the platform incorporates mainly CROSSMARC tools for the case studies in which it is being tested. However, it also enables the incorporation of new tools due to its open architecture design. The paper outlines ﬁrst CROSSMARC work, in relation to other works in the area. It presents then the ﬁrst version of the platform as well as some initial results from its use in case studies.

2 Related Work Collection of domain speciﬁc web pages involves the use of web focused crawling and spidering technologies. The motivation for web focused crawling comes from the poor performance of general-purpose search engines, which depend on the results of generic Web crawlers. The aim is to adapt the behavior of the search engine to the user requirements. The term “focused crawling” was introduced in [1] where the system presented, starts with a set of representative pages and a topic hierarchy and tries to ﬁnd more instances of interesting topics in the hierarchy by following the links in the seed pages. Another interesting approach to focused crawling is adopted by the InfoSpiders system [4], a multi-agent focused crawler which uses as starting points 1

http://www.iit.demokritos.gr/skel/crossmarc

An Open Platform for Collecting Domain Speciﬁc Web Pages

149

a set of keywords and a set of root pages. The crawler implemented involves three diﬀerent crawlers which exploit topic hierarchies, keywords from domain ontologies and lexica, and a set of representative pages [8]. While in focused crawling, the aim is to adapt the behavior of the search engine to the requirements of a user, in site-speciﬁc spidering the spider navigates in a Web site, following best-scored-ﬁrst links. Each Web page visited is evaluated, in order to decide whether it is really relevant to the topic, and its hyperlinks are scored in order to decide whether they are likely to lead to useful pages. Therefore, site-speciﬁc spidering involves two decision functions: one which classiﬁes Web pages as being interesting (e.g. laptop oﬀers) or not and one that scores hyperlinks, according to their potential usefulness. Thus, the input to the 1st decision function is a Web page visited by the spider and its output is a binary decision. This is a typical text classiﬁcation task. Various machine learning methods have been used for constructing such text classiﬁers. In [6] an up-to-date survey of such approaches is provided. In CROSSMARC we examined a large number of classiﬁcation approaches in order to ﬁnd the most appropriate one for each domain and language. Concerning the second decision function in site-speciﬁc spidering, this is a regression function, i.e., the input to the function is the hyperlink, together with its anchor and possibly surrounding text, and the output is a score, corresponding to the probability of reaching a product page quickly through this link. Like classiﬁcation, there is a variety of machine learning methods that are available for learning regression functions. However, in contrast to text classiﬁcation, the task of hyperlink scoring has not been studied extensively in the literature. Most of the work on scoring and ordering of links refers to Web-wide crawlers, rather than site-speciﬁc spiders, and is based on the popularity of the pages pointed by the links that are being examined. This approach is inappropriate for the spider implemented in CROSSMARC. The only really relevant work that has been identiﬁed in the literature is [5], who use a type of simpliﬁed reinforcement learning in order to score the hyperlinks met by a Web spider. A reinforcement learning link scoring methodology was also examined in CROSSMARC and was compared against a rule-based methodology. Concerning information extraction from web pages, a number of systems have been developed to extract structured data from web pages. A recent nice survey of existing web extraction tools is found in [3], where a classiﬁcation of these tools is proposed based on the technologies used for wrapper creation or induction. According to [3], tools can be classiﬁed in the following categories: • Languages for wrappers development: these are languages designed for assisting the manual creation of wrappers. • HTML-aware tools: these tools convert a web page into a tree representation that reﬂects the HTML tag hierarchy. Extraction rules are then applied to the tree representation.

150

V. Karkaletsis and C.D. Spyropoulos

Fig. 1. Classiﬁcation of web extraction tools (taken from [3]) and CROSSMARC position

• Wrapper Induction tools: these tools generate delimiter-based rules relying on page formatting features and not on linguistic ones. They present similarities with the HTML-aware tools. • NLP-based tools: these tools employ natural language processing (NLP) techniques, such as part of speech tagging, phrase chunking, to learn extraction rules; • Ontology-based tools: these tools employ a domain-speciﬁc ontology to locate ontology instances in the web page which are then used to ﬁll the template slots. CROSSMARC employs most of the categories of web extraction tools presented in [3] (see Fig. 1). It uses: • Wrapper Induction (WI) techniques in order to exploit the formatting features of the web pages. • NLP techniques to exploit linguistic features of the web pages enabling the processing of domain speciﬁc web pages in diﬀerent sites and in diﬀerent languages (multilingual, site-independent). • Ontology engineering to enable the creation and maintenance of ontologies, language-speciﬁc lexica as well as other application-speciﬁc resources. Details on the CROSSMARC extraction tools are presented in [2]. Other relevant publications can be found at the project’s web site.

3 The Platform CROSSMARC work resulted to a core system for web information retrieval and extraction which can be trained to new applications and languages and

An Open Platform for Collecting Domain Speciﬁc Web Pages

151

Fig. 2. System’s agent based architecture

a customization infrastructure that supports conﬁguration of the system to new domains and languages. The core system implements a distributed, multiagent, open and multi-lingual architecture which is depicted in Fig. 2. It involves components for the identiﬁcation of interesting web sites (focused crawling) and the location of domain-speciﬁc web pages within these sites (spidering), the extraction of information about product/oﬀer descriptions from the collected web pages, and the storage and presentation of the extracted information to the end-user according to his/her preferences. The infrastructure for conﬁguring to new domains and languages involves: an ontology management system for the creation and maintenance of the ontology, the lexicons and other ontology-related resources; a methodology and a tool for the formation of corpus necessary for the training and testing of the modules in the spidering component; a methodology and a tool for the collection and annotation of corpus necessary for the training and testing of the information extraction components. Based on this work, we started the development of a platform that will enable the integration, training and testing of collection and extraction tools (such as the ones developed in CROSSMARC) under a common interface. The experiences from building three diﬀerent applications using CROSSMARC tools assisted signiﬁcantly the platform deisgn. These applications concerned the extraction of information from: • laptops oﬀers in e-retailers web sites (in four languages), • job oﬀers in IT companies web sites (in four languages), • holidays’ packages in the sites of travel agencies (in two languages). According to CROSSMARC methodology, the building of an application involves two main stages. The 1st one concerns the creation of the application-

152

V. Karkaletsis and C.D. Spyropoulos

Fig. 3. Ontology tab: invoking the ontology management system

speciﬁc resources using the customization infrastructure, whereas the 2nd stage concerns the training of the integrated system using the applicationspeciﬁc resources, and the system conﬁguration. The 1st stage is realized, in our platform, by the “Ontology” and “Corpora” tabs. Through the “Ontology” tab (see Fig. 3), the user can invoke an ontology management system in order to create or update the domain speciﬁc ontology, the lexicons under the domain ontology, the important entities and fact types for the domain, and the user stereotypes’ deﬁnitions according to the ontology. In the current version, the ontology management system of CROSSMARC is used. The “Ontology” tab enables also the user to specify the location of the ontology related resources, he/she wants to use in the next steps of the application building (see Fig. 4). Through the “Corpora” Tab the user can invoke the perform several tasks. The user can invoke the Corpus Formation Tool (CFT), which helps users build a corpus of positive and negative pages, with respect to a given domain (see Fig. 5). This corpus is then used for the training and testing of the “Page Filtering” component of the spidering tool. In addition, the user can specify the folder(s) where the corpora for the training and testing of the “Information extraction” components are stored, and also invoke the annotation tool. The current version of the platform employs an annotation tool provided by the Ellogon language engineering

An Open Platform for Collecting Domain Speciﬁc Web Pages

153

Fig. 4. Ontology tab: specifying the ontology related resources

platform2 of our laboratory. It also supports the use of the CROSSMARC Web annotation tool [7]. The 2nd processing stage is realized by the “Training” and “Extraction” tabs. Through the “Training” tab (see Fig. 6), the user can invoke the machine learning based training tools for the “Page Filtering”, “Link scoring”, and “Information Extraction” components. Especially, in the case of “Information Extraction”, training involves two separate modules, the “Named entity recognition & classiﬁcation – NERC” module and the “Fact extraction – FE” module. The current version of the platform employs the Ellogon-based NERC and FE modules developed by our laboratory. The platform can support also the use of the other NERC and FE training tools developed in the context of CROSSMARC, since they all share common I/O speciﬁcations. Through the “Extraction” tab (see Fig. 7), the user can conﬁgure and test the “Crawling”, “Spidering” and “Information Extraction” components. In the case of “Crawling”, the user can set the starting points for the crawler editing the corresponding conﬁguration ﬁle. In a similar way, a diﬀerent crawler can be incorporated and conﬁgured according to the speciﬁc domains. A new crawler is currently under development and will be tested through the platform in a future case study. In the case of “Spidering”, the user can select the model for page ﬁltering and link scoring (a machine learning or a heuristics based), edit the heuristics based model, set a threshold for link scoring, and 2

http://www.ellogon.org

154

V. Karkaletsis and C.D. Spyropoulos

Fig. 5. Corpora tab: invoking the Corpus Formation Tool

perform several more advanced options. The user can test the components with various conﬁgurations, view the results and decide on the preferred conﬁguration. Concerning “Information Extraction”, the user can test separately the NERC and FE components, and conﬁgure the demarcation components. In the current version, the platform supports only the NERC component. It must be noted that the outcome of the platform use is not necessarily a complete web content collection and extraction system. As it is shown in the case studies section, the platform user can build a crawler for a new domain, a collection system (crawler and spider), a named entity recognition system, or an information extraction system. The outcome depends on the speciﬁc task needs and the domain.

4 Case Studies The current version of the platform was used for the building of several applications. Some of these applications are presented below grouped according to the diﬀerent tasks. The ﬁrst group of applications involves the development of crawlers for an information ﬁltering task. More speciﬁcally, the task was to develop crawlers for speciﬁc topics (English and Greek languages were covered) that will return

An Open Platform for Collecting Domain Speciﬁc Web Pages

155

Fig. 6. Training tab: invoking the training tool for page ﬁltering

Fig. 7. Extraction tab: conﬁguring the spidering component (advanced options)

lists of web sites for these topics. These lists are then used to train an information ﬁltering system. Examples of topics include web sites that provide a service to communicate (chat) with other users in real time, web sites that

156

V. Karkaletsis and C.D. Spyropoulos

provide e-mail services (send/receive e-mail messages), sites with job oﬀers, etc. In these cases, the “extraction” tab of the platform was used to conﬁgure the starting points of the crawler, test it and ﬁnd the best conﬁguration for each topic. Another group of applications concern the development of systems collecting web pages for speciﬁc domains and languages. An example domain is personal web pages of academic staﬀ in University departments (Greek pages were covered). Such applications involve the training of both the crawling and the spidering components using the platform functionalities. More speciﬁcally, the “ontology” tab for creating the domain-speciﬁc ontology and lexica, the “corpora” tab for creating the corpus for the training of page ﬁltering, the “training” tab for the training of the page ﬁltering and link scoring components, and the “extraction” tab for conﬁguring and testing the crawling and spidering components. A third group of applications concerns the development of named entity recognition systems for speciﬁc domains and languages, which require the collection and annotation of the necessary corpus, the training and testing of the system. In a similar way, information extraction systems can be developed. The ﬁnal group of applications integrate the collection and extraction mechanisms, as it was the case for the CROSSMARC domains. The platform, in its current status, does not support the development of such integrated applications.

5 Concluding Remarks The CROSSMARC project implemented a distributed, multi-agent, open and multilingual architecture for web retrieval and extraction, which integrates several components based on state of the art AI technologies and commercial tools. Based on this work we are developing a platform that enables the integration, training and testing of collection and extraction tools, such as the ones developed in CROSSMARC. A ﬁrst version of this platform is currently being tested in several case studies for the development of focused crawlers, spiders, and information extraction systems. The current version employs mainly CROSSMARC tools. However, due to its open design, other tools have also been employed and more will be integrated and tested in the near future.

References 1. Chakrabarti S., van den Berg M.H., Dom B.E.: Focused Crawling: a new approach to topic-speciﬁc Web resource discovery. Proceedings of the 8th International World Wide Web Conference, Toronto, Canada (1999)

An Open Platform for Collecting Domain Speciﬁc Web Pages

157

2. Karkaletsis V., Spyropoulos C.D., Grover C., Pazienza M.T., Coch J., Souﬂis D.: A Platform for Cross-lingual, Domain and User Adaptive Web Information Extraction. Proceedings of the European Conference in Artiﬁcial Intelligence (ECAI), Valencia, Spain (2004) 725–729 3. Laender A., Ribeiro-Neto B., da Silva A., Teixeira J.: A Brief Survey of Web Data Extraction Tools, ACM SIGMOD Records, vol. 31(2) (2002) 4. Menczer F., Belew R.K.: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning, 39(2/3) (2000) 203–242 5. Rennie J., McCallum A.: Eﬃcient Web Spidering with Reinforcement Learning. Proceedings of the 16th International Conference on Machine Learning (ICML99) (1999) 6. Sebastiani F.: Machine learning in automated text categorization. ACM Computing Surveys, 34(1) (2002) 7. Sigletos G., Farmakiotou D., Stamatakis K., Paliouras G., Karkaletsis V.: Annotating Web pages for the needs of Web Information Extraction applications. Proceedings of the 12th International WWW Conference (Poster Session), Budapest, Hungary (2003) 8. Stamatakis K., Karkaletsis V., Paliouras G., Horlock J., Grover C., Curran J.R., Dingare S.: Domain-Speciﬁc Web Site Identiﬁcation: The CROSSMARC Focused Web Crawler. Proceedings of the 2nd International Workshop on Web Document Analysis (WDA 2003), Edinburgh, UK (2003)

Extraction of the Useful Words from a Decisional Corpus. Contribution of Correspondence Analysis M´ onica B´ecue-Bertaut1 , Martin Rajman2 , Ludovic Lebart3 , and Eric Gaussier4 1

2

3

4

Universitat Polit`ecnica de Catalunnya, UPC, Barcelona, Spain [email protected] ´ Ecole Polytechnique F´ed´erale de Lausanne, EPFL, Lausanne, Switzerland [email protected] ´ Ecole Nationale Sup´erieure des T´el´ecommunications, ENST, Paris, France [email protected] Xerox Research Centre Europe, Grenoble, France [email protected]

Abstract. In the framework of the JuriSent case study, carried out within the European NEMIS thematic network, we analyze the contribution of text mining techniques to improve the consultation of jurisprudence textual databases. We mainly focus on correspondence analysis (CA) techniques, but also provide some insights on similar visualization techniques, such as self organizing maps (Kohonen maps), and review the potential impact of various Natural Language pre-processing techniques. CA is described in more detail, as well as its use in all the steps of the analysis. A concrete example is provided to illustrate the value of the results obtained with CA techniques for an enhanced access to the studied jurisprudence corpus.

1 Objectives of the JuriSent Case Study Following a long tradition, the documentation service of the Spanish judiciary system is currently building up an electronic textual data base which collects all the sentences of the collegiate courts. This base will serve for multiple purposes, in particular for allowing a more eﬃcient consultation by the lawyers who need to access jurisprudence and/or the diverse interpretations of the laws. In this context, text mining techniques face an interesting challenge. In the framework of the JuriSent case study carried out in the European NEMIS thematic network, our goal was to study the potential contribution of correspondence analysis (CA) and some related methods to improve the existing tools for legal data base consultation. In particular, we aimed at: M. B´ ecue-Bertaut et al.: Extraction of the Useful Words from a Decisional Corpus Contribution of Correspondence Analysis, StudFuzz 185, 159–179 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

160

M.B. Bertaut et al.

• testing the contribution of advanced text mining techniques (in our case correspondence analysis techniques and self organizing maps) to automatically extract knowledge from large collections of judicial sentences; • evaluating the usability of the extracted knowledge to improve the access to information stored in jurisprudential data bases; • validating the selected text mining techniques on a real corpus of legal sentences relative to prostitution. The general idea behind the use of CA techniques for an improved access to legal data relies on the fact that any document collection (and, in particular, a legal data base) can be seen as double-entry table crossing documents (in our cases, legal sentences) and words. By identifying the favored associations between speciﬁc sentences and speciﬁc words, CA then allows for the symmetric extraction of: • the groups of words used to express a certain underlying legal reasoning; • the sentences that are most related to a given legal reasoning. These extracted associations are then used to make jurisprudential search easier. The structure of this contribution is the following: in Sect. 2, we explain the goals of the lawyers when consulting jurisprudence data bases. These goals are characteristic of the current approach used with professional legal data bases. Then, in Sect. 3, we present CA and its properties in the textual data analysis context. Various preprocessing aspects are considered in Sect. 4. The data used for the case study and the results obtained are presented in Sects. 5 and 6, respectively. Finally, the conclusion summarizes the contribution of the proposed methodology and mentions some further improvements.

2 The Data Used for the Case Study The data used for the case study consists of a corpus of 430 legal sentences issued by the Spanish Tribunal Supremo (Supreme Court), and relative to prostitution oﬀences, from 1979 to 1996. In the following, by the term sentence, we refer to the document published by the court at the end of a trial. This document contains the verdict, as well as the other parts concerning the motivation of the verdict. The indicated time range (1979–1996) has been chosen because of its relation to the predictable stability of the legal norm: the new Spanish constitution was voted in December 1978, and a very important reform of the criminal code took place at the end of 1996. The used corpus is extracted from the professional sentence data bases published on electronic support by Aranzadi editorial. The corpus was carefully normalized, with the goal to guarantee correct spellings, and also to standardize the wordings. For example, a unique form should be used for entities originally referred to by diﬀerent expressions or abbreviations.

Extraction of the Useful Words from a Decisional Corpus. Contribution

161

Head: Summary: prostitution (used for the analysis) Legal support: (not used for the analysis) Judge-redactor: (used for the analysis) Factual bases: (not used for the analysis) Verdict: (not directly used for the analysis) Legal bases: (used for the analysis) Fig. 1. The structure of a sentence

As shown in Fig. 1 below, legal sentences consist of several distinct parts: • a head, entered by the data base manager to summarize the sentences and ease its retrieval; • the factual bases, which describe, in a neutral way, the facts that have been observed at every step of the instruction; • the verdict, providing the precise wording of the condemnation; in the case of second instance courts, the verdict can be the non-acceptance of the appeal or, in the case of acceptance, a new condemnation; • the legal bases, which justify the verdict by providing clear links between the laws and the facts.

3 Consulting Jurisprudence Databases: Goals and Current Approaches 3.1 A Sentence as a Decision Process The function of the judge is to make a decision. This decision is taken within the existing legal framework, but the judge of course keeps an interpretation margin. However, to exploit this margin, he/she has to develop a decision process that relates the facts to the decision, following a normative model: contradictory analysis of the arguments, references to the legal texts, and justiﬁcation of the choices. The reasoning is going from the facts to the oﬀence qualiﬁcation (and the resulting verdict). As such, the verdict must therefore be motivated, legally justiﬁed, and expressed in such a way that the links between laws and facts are clearly established. Moreover, at the Supreme Court level, the decision can be especially complex, as it has to consider several, possibly contradictory options, sometimes only supported by incomplete information. The arguments (called legal standards, or simply standards, in the rest of this contribution) used by the judge to develop his/her reasoning constitute

162

M.B. Bertaut et al.

the visible marks of the process used for analyzing the law and making the legal decisions. The legal standards are therefore tools which allow the judge to manage and organize information, knowledge and references within the sentence. They play a fundamental functional role and are a constitutive part of the justiﬁcatory reasoning, very diﬀerent from a syllogistic reasoning, used in this context. For example expressions such as corrupci´ on de menores (corruption of minors), presunci´ on de inocencia (presumption of innocence), convicci´ on (conviction) or establecimiento de la prueba (adduction of evidence), but also more complex combinations of this kind of expressions, such as corrupci´ on de menores versus abusos deshonestos (corruption of minors versus dishonest abuse), can be considered as standards. Notice however that a legal standard has no clear deﬁnition. It is a rhetoric tool, a part of a reasoning process that aims at relating the analysis of the facts with a legal decision. In other words, it is a part of the legal method used to link all the data concerning the problem to be solved [4]. In our work, we consider the legal standards as complex textual elements, providing essential support to the argumentation. 3.2 What are the Lawyers Looking for When Consulting the Jurisprudence Data Bases? When consulting the jurisprudence data bases, a lawyer searches for previous sentences supporting her/his reasoning, investigates the possible consequences of a certain type of reasoning (for example, always a cassation), or looks for alternative reasoning models. In this perspective, his/her queries can concern not only an oﬀence, but also the diﬀerent arguments that can be associated to it. Some queries can even be transversal to diﬀerent oﬀences, only related to a similar reasoning model. 3.3 The Current Approach The current approach to jurisprudence data access is conditioned by the access functionalities provided by the existing professional tools. Usually these tools provide an access relying on (see Fig. 2 below): • a structured, possibly partial, description of the targeted document(s); • the use of a hierarchical thesaurus which can be completed, at the last level, by free text search. The second functionality leads to very good results when the arguments are expressed with the terms used in the thesaurus, and when the user exactly knows the words he/she has to use for the free text search. However, only the sentences corresponding to an exact match will be retrieved. In other words, one of the main limitations of the current, thesaurus based approach is that it relies on an a priori selected vocabulary. One of the main contributions

Extraction of the Useful Words from a Decisional Corpus. Contribution

Subject:

Documents corresponding to the request:

criminal

Offence:

163

marginal: rj 1994\5163, resolución: sentencia de 29 junio 1994, número 1347\1994. recurso

prostitution

Thesaurus terms contradiction in proved facts impossible offence … Free text terms: Pistola (gun)

número 3739\1993 jurisdicción: penal (ts, sala de lo penal) resumen: (…) y contradicciones en hechos probados. (…) el ts declara no haber lugar al recurso de casación, por quebrantamiento de forma e infracción de ley, interpuesto por los acusados nnn aaa y nnn aaa contra la sentencia de la audiencia provincial de burgos (sección 1ª) que le condenó por delito de tenencia ilícita de armas (…) tal como previene el artículo 654 lecr, sin que tampoco tales pruebas estuvieran en la sala en el momento de celebrarse el juicio oral, lo que entiende le produjo además indefensión toda vez que en autos aparecen reseñadas dos

Fig. 2. Access to jurisprudence data bases with the thesaurus approach

of the approach based in the CA techniques described in this work is that the relevant vocabulary is identiﬁed a posteriori, as the most explicative one for the implicit structure induced on the sentences by the underlying legal reasoning. In the following section, we describe the main characteristics of this approach.

4 Consulting Jurisprudence Databases: An Approach Based on Correspondence Analysis 4.1 Correspondence Analysis as a Method for Statistical Analysis of Texts Correspondence Analysis (CA) is a ﬂexible technique for describing large contingency tables. In the context of textual data analysis, it can be applied to speciﬁc contingency tables, the (documents × words) tables [1, 2, 3, 9, 10], indicating in a tabular manner which words appear in which document, and with which frequency. In our case, the goal of the method is to allow for comparison of the lexical proﬁles of the sentences (i.e. the proﬁles consisting of the relative occurrence frequencies of the words appearing in a given sentence), and of the document proﬁles of the words (i.e. the relative frequency distributions of the words over the diﬀerent sentences). In technical terms, to perform a CA on a contingency table is equivalent to perform a singular value decomposition (SVD) [5] on it, using two speciﬁc metrics.

164

M.B. Bertaut et al.

The important point is that CA induces exploitable distances between documents (i.e. sentences) and between terms, as well as weight systems on documents and terms. In the document space, the distance between two documents is the chi-2 distance between their lexical proﬁles. The chi-2 distance is a weighted Euclidean distance which gives every word a weight equal to the inverse of its relative frequency in the whole corpus. Similarly, in the word space, the distance between two words is the chi-2 distance between their document proﬁles, i.e. a weighted Euclidean distance which again gives every document a weight equal to the inverse of its relative frequency. Another important aspect of CA is that it oﬀers the possibility to simultaneously visualize the distances between documents and between terms in two dimensional maps (usually called principal planes). In these maps, two documents are close if they contain words which are close to one another. Notice that the contained words do not need to be identical (exact match), but only close. This fundamental property of CA is in accordance to linguistic properties established in the ﬁeld of distributional linguistic [7]. Two words are close if they are frequently used in the same documents and/or if they are frequently associated to the same terms. This last property is important because it allows CA to detect not only frequent co-occurrences, but also synonymy relations, and to take advantage of them. In the graphical displays produced by CA, one can superimpose the documents and the word maps. This simultaneous representation is valid because of the so-called transition relations that link the coordinates of a document to all the word coordinates, and, symmetrically, the transition relations that link the coordinates of a word to all the document coordinates. This simultaneous representation is helpful for the interpretation of the proximities: the proximity between two sentences can be explained by their use of speciﬁc words; the proximity between two words can be explained by their similar distribution over the set of sentences. Notice however that, in general, it is not straightforward to interpret the proximity between a sentence and a word, because they do not belong to the same initial representation space. Nevertheless, for the words and sentences lying in extreme positions with respect to the representation axes, a strong association can still be assumed. This property will be of great importance in the present study. 4.2 A Strategy Using the CA-Based Approach Main hypothesis The use of the CA approach for the processing of jurisprudence data relies on the following main hypothesis: a lexical proximity between the sentences is also indicative of a proximity between the underlying legal reasoning. In other words, if sentences are close from a lexical point of view, they are also close from a legal point of view. As a consequence, a lexical similarity can be used to predict similar legal argumentations (and vice-versa). Notice that this hypothesis deeply relies on the strongly normalized nature of the legal language.

Extraction of the Useful Words from a Decisional Corpus. Contribution

165

By applying correspondence analysis to the (sentences × words) tables, we will therefore detect the main standards, as described by the most relevant words and sequences of words used to express them. Furthermore, we will be able to identity the sentences which are the most associated to every standard. Strategy for the extraction of groups of associated words and of related sentences The results of CA are produced in the following way: • For each of the axes produced by the analysis, the words lying on extreme positions with respect to each of the axes are identiﬁed. When distinguishing the extreme positive and negative coordinates, we therefore obtain two groups of words for each of the axes. The number of axes to be considered depends on the study. Notice that, as the meaning of a word is context dependent, a word can belong to several groups. • Each of the identiﬁed groups of words is a candidate for being a standard. The validation of the obtained standard candidates is done by examining the coherence among the words it contains, by looking for their context of use, and by also taking into account the analyst’s knowledge of the legal ﬁeld. The validated standard candidates might be further reformulated by the analyst in the form of short, synthetic paragraphs expressing their meaning. • For each validated standard, the sentences with extreme positions on the corresponding most associated axis extremities are identiﬁed. As mentioned earlier, due to the transition relations on coordinates, these sentences are the most related to the selected standards, and are therefore considered as the most relevant for it. In short, the outcome of CA is the list of the detected standards, with, for each of them, the most relevant words and expressions used to refer to it, and the list of the sentences the most associated to it.

5 Pre-Processing Aspects Important aspects to consider before applying the CA techniques to a collection of documents are the following: • the normalization of the texts: before performing an automatic analysis of a corpus, it is necessary to normalize the texts. In the case of a legal corpus, various speciﬁc choices have to be made. These choices mainly concern: – the dates, which have to be expressed in a unique, consistent way; – the abbreviations relative to legal texts and laws; – the mentioned institutions, as the same institution can be referred to by many diﬀerent expressions. • the inﬂuence of the chronology on vocabulary and reasoning (see Sect. 5.1 below);

166

M.B. Bertaut et al.

• the internal heterogeneity of the sentences, as the sentence collection might need to be further divided into more homogeneous parts; • the potential need for linguistic pre-preprocessing of the texts (see Sect. 5.2 below). In the two next sections, we consider in more detail the study of the inﬂuence of the chronology and the potential impact of linguistic pre-processing. 5.1 Inﬂuence of the Chronology on Vocabulary and Reasoning A priori, we have chosen a stable time period ranging from the vote of the new Spanish constitution (voted on December 1978) to the change of the criminal law (end of 1996). Nevertheless, it remains necessary to check for the existence of a potential lexical shift during this long period and, if necessary, to divide it into more homogeneous sub-periods, i.e. sub-periods sharing a common vocabulary. For that purpose, we used the following methodology: • A (years × words) contingency table is built by merging all the sentences produced in the same year and a standard CA is applied to this new table, allowing for the lexical proﬁles of the diﬀerent years to be compared. • As the representation on the principal plane can suﬀer from various distortions consecutive to the projection of high dimensional data onto a representation plane of low (2) dimensionality, a hierarchical clustering is performed on the set of years, as described by their coordinates on the axes produced by CA. In fact, only some of the ﬁrst axes might be taken into account, and, in this case, CA is performing a ﬁlter on the noise contained in the data ([10], p. 86). Notice also that the set of produced clusters also depends on the strength of the clustering (as deﬁned by the level of cut in the hierarchical tree of clusters produced by the clustering algorithm). A standard method to cut the hierarchical tree is to use a distance index (in our case the Ward index), and to cut the tree at the level where the index undergoes an important variation. • Once the hierarchical clustering is performed, only the produced clusters that exclusively contain consecutive years are considered as potential candidates for interesting sub-periods. The identiﬁed sub-period candidates then need to be validated on the basis of the associated most relevant words and analyst’s available external knowledge. The results obtained with this method are presented in more detail in Sect. 6. 5.2 Potential Impact of Linguistic Pre-Processing In order to assess the importance of Natural Language pre-processing tools in the context of our work, we review in this section the diﬀerent linguistic components that might be considered. To do so, we follow the standard linguistic division into 3 consecutive processing levels, respectively dealing with morphology, syntax, and semantics.

Extraction of the Useful Words from a Decisional Corpus. Contribution

167

Morphology The general goal of morphological analysis is to normalize the diﬀerent forms (often called morphological variants), of the same “concept” into a single form used as a representative for the diﬀerent variants. The ﬁrst step of morphological analysis consists in looking up for possible word candidates into available electronic dictionaries, so as to segment the text into elementary units. The second step assigns to each word its possible parts-of-speech, as well as information on number, gender, declension and conjugation, if any. Again, this process usually relies on the availability of such information in electronic form. These two ﬁrst steps are followed by part-of-speech tagging, which consists in selecting, according to the context of each word, its correct part-of-speech and associated information. For example, the word rate can be a noun or a verb, but in the sequence “The hepatocyte respiratory rate . . . ”, it most certainly is a noun. Once part-of-speech tagging has been performed, it is possible to produce a ﬁrst normalization of the words by selecting their lemma. For Spanish, this would be the singular form for nouns, masculine, singular form for adjectives, and inﬁnitive form for verbs. This process (called lemmatization) is used in many textual applications, and has been shown to be useful in Information Retrieval (IR), where the goal is to match diﬀerent textual units (queries and documents). In such cases, lemmatization allows getting more accurate statistics for the diﬀerent words, and, since the underlying vocabulary has been reduced, it usually moves similar documents closer to each other. In our work, we want to identify “standards” through the extraction of the main concepts they rely on. To this extent, nouns play a privileged role, and lemmatization would only normalize here singular and plural forms. Notice however that it has been frequently argued that the meaning of nouns can diﬀer when used in the singular or plural forms. For example, prostituci´ on (singular) would denote the abstract, general concept of prostitution, whereas prostituciones would denote the diﬀerent forms prostitution can take, as occasional or regular prostitution, prostitution in bars or in streets, etc. Nevertheless, even though diﬀerent, the singular and plural forms are closely related, and lemmatization should lead to a ﬁrst, rough set of potential concepts, that may be reﬁned in latter stages. In many practical situations, one focuses on a subset of the vocabulary, namely the most frequent words. If the used frequency threshold does not ﬁlter out diﬀerent morphological variants, lemmatization will not change the obtained ﬁnal vocabulary. It might however change the relations between words or groups of words and the documents they appear in. Based on the above consideration, one possible methodology would be to use lemmatization in a ﬁrst step, aiming at identifying the main concepts of a collection. Once a ﬁrst rough set of concepts is identiﬁed, the diﬀerent morphological variants can be considered for a reﬁned analysis. However, as mentioned before, it is not certain that lemmatization will bring additional insights into the collection at hand. Depending on the style and vocabulary

168

M.B. Bertaut et al.

used (quite speciﬁc in the legal domain), the beneﬁts of lemmatization might vary and should always be assessed through experimentation. An even more radical approach to word normalization is stemming, which consists in substituting to each word form a possible stem, i.e. a part of the word considered as its morphological root. For example, both prostituci´ on and prostitutas might be conﬂated into unique stem prostitu. Stemming can be advantageous when the diﬀerent words of the same derivational family do bear similar meaning (usually the one conveyed by the stem). However, stemming sometimes hides important semantic diﬀerences. With the aforementioned example, documents dealing with the condition or life of prostitutes would be similar to the ones dealing with prostitution in general. Since most documents in the present context are related to prostitution or prostitutes, such normalization would not allow us to identify ﬁne-grained distinctions between the involved concepts. The fact that stemming usually serves, as for example in IR, as a recall enhancement procedure [6], would probably prevent us from identifying the more precise concepts we need for our analysis. Syntax Another possibility to derive an enhanced representation of documents is syntactic analysis. This analysis consists in identifying the diﬀerent grammatical relations that exist between words in a given sentence. Even though complete syntactic analyzers are diﬃcult to build and do not exist for many specialized domains, term extractors could certainly be applied to the legal data. In this section, we call terms textual units that can consist of one or more lexical words1 , denoting unambiguous concepts of a domain. Examples of terms in our corpus are: presunci´ on de inocencia, juez de instrucci´ on or delito de corrupci´ on de menores. Candidate terms can be identiﬁed through morphosyntactic patterns. For languages relying on a composition of Romance type (as Spanish does), such patterns for terms of length 2 (i.e. containing two lexical words) usually have the general form of “Noun Adjective”, or “Noun Preposition (Determiner) Adjective”, where the determiner is optional in the second pattern (candidates without determiners usually display a higher degree of termhood than candidates with determiners). Terms containing more than two lexical words are formed through composition of terms of length 1 and 2 (delito de corrupci´ on de menores results from the composition of delito with corrupci´ on de menores). Using such patterns, it is possible to extract, from a given document, all the candidate terms, and the corresponding computational cost is linear in the length of the document. The obtained term candidates can then be used as indexes for the documents. 1

A lexical word is a word that is either a noun, a verb, an adjective or an adverb. Lexical words are the most important for the expression of the meaning, and are usually opposed to grammatical words (prepositions, determiners, etc.), which essentially play a grammatical role.

Extraction of the Useful Words from a Decisional Corpus. Contribution

169

It is often argued that the above process has the advantage of providing a more accurate representation of documents. For example, a document dealing with corrupci´ on de menores would be indexed by this exact term, and not by two, a priori unrelated indexes corrupci´ on and menores. However this argumentation is valid only if the two terms corrupci´ on and menores can appear in the same document without denoting the complete concept corrupci´ on de menores, which, even though possible, is unlikely in a closed domain as the one under study2 . We thus believe that relying on more complex terms would not lead to substantially diﬀerent results in the present context. It has to be noted however that complex terms could be used to help analysts interpret the results provided by correspondence analysis, since they provide a direct access to the concepts of a domain. Semantics Lexical semantics deals with the relations between word meanings. Among these relations, synonymy (two diﬀerent words have the same meaning), hyperonymy (one word has a meaning encompassing the one of the other word) and its converse, hyponymy, can be used to relate documents that seem unrelated with respect to their surface form. However, the use of these relations requires semantic resources adapted to a particular domain and collection. Since such resources are not available in the present case, we do not discuss this possibility further, and limit the integration of synonymy to its indirect processing, as it is implicitly performed by CA.

6 An Example 6.1 Characteristics of the Analyzed Corpus The corpus extracted from the “Aranzadi data base” consists in the “head”, “verdict” and “legal bases” of all the sentences issued from the Spanish Supreme Court between 1979 and 1996, and relative to oﬀences linked to prostitution. The whole corpus sums up to 430 sentences. It was not lemmatized, and had a total length of 507.475 word occurrences, corresponding to 19.376 distinct words. 3494 words were repeated at least 10 times and covered 93% of the corpus.

2

An additional problem raised by the consideration of complex terms lies in the notion of similarity between terms. Should corrupci´ on de menores and corrupci´ on be considered as similar or not? We have proposed in [6] an approach to indexing and document similarity comparison based on the use of both simple and complex terms. It is not clear however how such an approach would perform in the framework of correspondence analysis.

170

M.B. Bertaut et al.

6.2 Study of the Inﬂuence of the Chronology: the Choice of the Homogeneous Time Sub-Periods The High Court is a collegiate court, with 3 judges; nonetheless, only one of them, the judge-rapporteur, is in charge of delivering the opinion of the court. Altogether, 36 judge-rapporteurs were involved in the elaboration of the sentences present in our corpus. Figure 3 shows the temporal distribution of the activity of the 24 judges who wrote at least 10 sentences.

79 80 81 82 83

84 85 86 87 88 89 90 91 92 93 94 95 96

C.P. G.S. H.P. G.L.C. L.B. R.L. C.M.P. D.P. G.M. H.A.L. V.M. M.M. D.V.R. Mo.M. S.N. G.A. R.V. B.Z. M.F.C M.P. P.L. M.P.R. C.P.F. Fig. 3. Temporal distribution of the activity of the judge-rapporteurs (considering only the judges who wrote at least 10 sentences)

To identify potential sub-periods, we used the methodology proposed in Sect. 5.2. Figure 4 presents the relative positions obtained for the years by applying CA on the (years x words) contingency table. As usual for representations in the principal plane, these positions correspond to the coordinates on the two ﬁrst axes produced by CA, and the quality of the representation can be measured by the percentage of preserved total inertia. In our case, 32%

Extraction of the Useful Words from a Decisional Corpus. Contribution

F2

1996

171

=0.024 8.9%

1979 1995 1994

1980 1981 1982

1993

1984 1987

1992 1990 1988 1989

=0.063 23% F1

1983 1985 1986

1991

Fig. 4. The years represented on the principal plane issued from CA

were preserved. Notice that, given the high number of present words, we did not represent them in the plane. The graphical representation provides very interesting information about the proximities between years and oﬀers a global visualization of the temporal evolution. As it is usual for CA based representations of chronological data, the years are approximately located along a parabola, which corresponds to the case of a standard progressive vocabulary renewal. To reach more reliable conclusions about potential temporal vocabulary shifts, we performed a clustering of the years, based on their coordinates on the ﬁrst 4 axes. The obtained hierarchical tree is represented as a dendrogram in Fig. 5, and indicates that there is a clear main cut in 1988, therefore leading to two main sub-periods: before 1988 and since 1988. The years in the since 1988 sub-period could further be divided into 1988–1992 and 1993–1996. However, after studying the homogeneity of the most relevant words for each of the identiﬁed sub-periods, we ﬁnally decided to keep only two clusters: 1979–1987 and 1988–1996.

172

M.B. Bertaut et al.

1979 1981 1980 1982 1987 1985 1983 1984 1986 1988 1989 1990 1992 1991 1996 1993 1995 1994

Fig. 5. Hierarchical tree built up from the distances between years calculated from the coordinates on the 4 ﬁrst principal axes

Notice that, in the proposed clustering method, the years are distributed in the diﬀerent clusters only on the basis of their lexical proﬁles, without taking into account any information about their actual chronology. Therefore, the fact that the method produces clusters with contiguous years only means that the vocabulary indeed brings strong information about the period in which the sentences were written. In order to provide some external justiﬁcation of the resulting selected clustering, we studied the relation between the main temporal cut identiﬁed in the corpus and the periods of activity of the diﬀerent judge-rapporteurs. As illustrated in Fig. 3, we can see that: • • • •

between1985 and 1988, 6 judges withdrew; in the same period, 4 new judges were appointed; between 1989 and 1992, 7 new judges were appointed; 6 judges were in activity during the major part of the studied period.

These observations led to the hypothesis that the identiﬁed lexical shifts might be related to a generational change between 1985 and 1988. Furthermore, other auxiliary methods also provided additional conﬁrmation for the fact that the main temporal cut in the corpus is indeed in 1988, although the ﬁrst signs of change can be detected already in 1985. Finally, Fig. 6 shows the most relevant words of each of the sub-periods, grouped with respect to their syntactic category. 6.3 The Obtained Results The selection of the parts of the sentences to be actually analyzed depends on the quality of the obtained representations. For the results presented in this section, we only took into account the part “Legal Bases” of the sentences

Extraction of the Useful Words from a Decisional Corpus. Contribution

173

Before 1988: over-represented words Nouns: camareras, prostitución, considerandos, hombres, bar, procesada, tercería, procesado, tráfico, favorecimiento, decreto, auxilio, precepto, número, corrupción, precio, modalidades, escándalo, texto, mujeres, entrega, resumen, ponente, clientes, señor, jurisdicción, prostitutas, cooperación, casación, consistencia, omisiones, facilitación, inciso, lacra, resolución, reservado, proxenetismo, huella, delito, improcedencia, corriente, estrago, sujetos, causa, reservado, camarera, lucro, mérito, índole, urgencia, empleo, cantidades, pudor, peligrosidad, matrimonio, comercio, consumiciones, trato, procesados, origen, grado, yacimiento, pisos, moralidad, apartados, vicio, mitad Adjectives: carnal, seguida, criminal, dictada, inmoral, interpuesto, siendo, moral, mentado, legal, empleadas, refundido, marginal, excelentísimo, impugnada, señalada, explotado, relativos, carnales, fáctica, citado, locativa, lícita, impúdicas, delictiva, ética, incardinada, ajenas, perniciosos, venal, mismos, encaminadas, activos, colectiva, organizada, buenas, prostituidas, general, inferiores, viciosa, social, defensiva, acuartelada, terminante Verbs: considerando, resultando, cohabitar, desestimar, comprende, careciendo, revisado, tipifica, solicitaban, explotar, declara, facilitar, quedan, anula, Adverbs: ordinariamente, anteriormente, indiscriminadamente, explicitamente Pronouns: cuyo Abbreviations and references to legal texts or sentences: RJ, 1º, 1963/759, 691/1963, TS, dis_estudiadas, CP, 16, 1981/1517, 3096/1973, 1956/438, 1982/3544, 438, 1973\2255, 1975\2330, 1981\2093, 900, 1009 28, 17, VII, 1981\147, 1981\143, DL Since 1988: over-represented words Nouns: prueba, derecho, declaraciones, inocencia, acusado, presunción, juicio, vulneración, fundamentos, pruebas, cargo, motivo, error, hecho, acusados, registro, fundamento, documentos, valoración, constitución, principio, folio, instrucción, libertad, folios, apreciación, defensa, acta, violación, vía, víctima, fiscal, juzgado, credibilidad, testigos, entrada, denuncia, apoyo, ministerio, testimonio, convicción, testigo, juez, agresión, detención, inmediación, acusada, garantías, existencia, informe, autor, manifestaciones, declaración, documento, proceso, respuesta, base, jurisprudencia, policía, diligencias, relación, inadmisión, intervención, contradicción, secretario, experiencia, letrado, pub, igualdad, impugnación, violencia, atestado, validez, sección, médico, inhabilitación, asistencia Adjectives: oral, probatoria, judicial, constitucional, fundamental, casacional, desestimado, documental, prestadas, plenario, provincial, efectiva, procesal, acusatorio, correlativo, especial, Verbs: ha, véanse, confrontar, dijo, existe, debe, aduce, dice, podido, cabe Adverbs: no, aquí, ni, Pronouns: nos, ello, ti Abbreviations and references to legal texts or sentences: APNDL, CE, RTC, 24, LOPJ, 1978\2836, 2875, 1985\1578, 8375, 2635, 849, 885, LECR, TC, dieciocho, 3/1989, 85, 1989\1352 Fig. 6. List of the characteristic words of the two sub-periods “Before 1988” and “Since 1988”

174

M.B. Bertaut et al.

produced between 1988 and 1996 (second sub-period). Therefore, in this step, we analyzed a reduced corpus of 251 sentences, with a total length of 313.523 word occurrences, corresponding to 15.519 distinct words. After the pre-processing steps, the objective was to identify the main standards used in the corpus and to relate them to their most relevant sentences. For this purpose, we applied the methodology presented in Sect. 4.2 to the corpus restricted to the words used at least 20 times (1398 distinct words, corresponding to 85% of the corpus). The analysis of the histogram of the eigenvalues resulting from the CA suggested us to consider the ﬁrst 10 axes, i.e. 20 possible groups of words (the extreme left and right side of each axis) and the associated sentences. In order to identify the standards corresponding to each of the groups, we analyzed how the identiﬁed words were used in the sentences. To illustrate this type of analysis, we provide here the example of two groups of words: the groups lying on the positive end of axis 1 and the group lying on the negative end of axis 3. These two groups refer to a same oﬀence (“corruption of minors”) but concern very diﬀerent aspects of it, as illustrated by the excerpts given in Fig. 7. In the ﬁrst case (positive extremity of axis 1), the problem to be solved is the correct qualiﬁcation of the oﬀence and/or of the associated oﬀences: the judge has to decide whether the oﬀence is “corruption of minors” or “sexual aggression” or, even, in some cases, whether any of these two oﬀences was in fact committed. In the second case (negative extremity of axis 3), the problem only consists in determining whether the oﬀence “corruption of minors” was indeed committed. An additional aid to analyze the results produced by CA is to project the characteristic words on the principal plane. However, in order to deal with Axis 1, positive extremity: over-represented words Standard 1: correct qualification of the offence, mainly discussion between corruption of minors, sexual aggression or dishonest abuses Words: corruption, menores, menor, sexual, personalidad, años, actos, abusos, delito, agresión, deshonestos, formación, edad, libertad, tocamientos, persistencia, intensidad, corruptora, agresiones, perversion, sexucales, acciones, conducta, moral (more references to law articles Axis 3, negative extremity: over-represented words Standard 6: discussion about the actual commission of the offence “corruption of minors”, mentioning evidence and/or presumed innocence Words: menor, prueba, niña, alegado, personalidad, convicción, acusado, pruebas, sentencias, juicio, menores, cargo, inocencia, inadmisión, testimonio, tribunal, presunción, probatoria, formación, corrupción, declaraciones, judicial, armas

Fig. 7. Two examples of legal standards, as reformulated by the analyst, and associated with the corresponding most relevant words

Extraction of the Useful Words from a Decisional Corpus. Contribution

175

Fig. 8. Two dimensional display of words as produced by CA (frequency threshold: 90)

a manageable number of words, a much higher frequency threshold had to be considered. In our case, only the words appearing at least 90 times in the corpus were kept, leading to a total of 387 distinct words, representing 223.499 word occurrences corresponding to 73% of the corpus. Figure 8 illustrates the projection of the selected 387 words in the principal plane. Notice that the central area in which the majority of words are jammed together would require some zooming to be actually exploited. However, one can clearly see, for example on the right hand side of Fig. 8, the words menores, agresi´ on, and personalidad used in the above mentioned analysis. Notice also that the 251 sentences might have been displayed on the same representation. In this case, the analyst could iterate from a word or a group of words to a speciﬁc sentence, and vice-versa. Some search engines use similar procedures to perform query expansion, but in the CA case, this does not correspond to a black box methodology, and can be quite well grounded in a solid underlying theory. Finally, as an example of additional tools that might be used in conjunction with the CA techniques, Fig. 9 presents a Kohonen map, a variety of selforganizing maps [8] derived from the same data set.

176

M.B. Bertaut et al.

Fig. 9. Two dimensional display of words as produced by Kohonen maps on the same data as in Fig. 8 (frequency threshold: 90)

In this ﬁgure, each cell of the (10 × 10) grid corresponds to a cluster of words. The topology of the grid is representative of the proﬁle similarities between the words. Words belonging to the same cell have similar proﬁles, and, to a lesser degree, have similar proﬁles with the words of the neighboring cells. Some of the obtained clusters are similar to the ones produced by CA. However, the Kohonen map being a non-linear technique, they may take into account more than two axes, which is an undeniable advantage over the displays produced by CA. On the other hand, the simultaneous representation of words and sentences is not at all straightforward with Kohonen maps, which is a clear drawback in our context. Two other drawbacks can be mentioned for the Kohonen maps in our perspective: ﬁrst, as their underlying algorithms are closely related to the well-known k-means algorithms, they converge towards local minima, generally depending on some initial conditions, therefore not guaranteeing the best possible representations. Furthermore, unlike CA, Kohonen maps do not provided any reliable validation techniques, nor statistical tests.

Extraction of the Useful Words from a Decisional Corpus. Contribution

177

6.4 Application to an Enhanced Access to Jurisprudence Data Basis The results presented in Sect. 6.3 show that the use of CA can beneﬁt to an electronic jurisprudence consultation system. For example, a user facing the problem of determining if the oﬀence of corruption of minors has been actually committed will obtain a more useful answer if CA is incorporated to such a system. Indeed, with the current traditional approach (see Sect. 3.3 and Fig. 2), the user need ﬁrst to indicate the subject “criminal ” and the term “prostitution”. Then, he has to choose one, and only one, of the suggested terms, for example, “corruption of minors” in the list of the proposed expressions. As a result, he obtains all the sentences concerned by “corruption of minors”, i.e. a huge collection of sentences among which he will have to ﬁnd those that are related to his actual problem. He may, of course, also precise his request by using the free text ﬁeld but, to do so, he would need to have a very precise idea about the relevant complementary terms to use, which is neither easy, nor, most often, suﬃcient for a precise selection of the concerned sentences. CA approach, on the contrary, provides more accurate results. After the selection of the concerned period and oﬀence (or group of oﬀences), the system would propose to the user the list of the standards used in the chosen collection of sentences (i.e. the list of the selected groups of associated words, or, if available, the reformulation of these groups of words produced by the analyst). By selecting the standard(s) corresponding to his concern, the user would then be able to access exclusively the sentences associated with the selected standard(s), ranked by decreasing order of association strength. For example, by choosing the standard 6 given in Fig. 7, the user would obtain the results presented in Fig. 10.

Fig. 10. Excerpt of the two sentences the most associated with the chosen standard

178

M.B. Bertaut et al.

7 Conclusions CA provides an eﬃcient aid for organizing the studied sentence database by taking into account proximities related to similar underlying standards expressed by lexical similarities. The identiﬁed standards are useful for a more eﬃcient jurisprudence search, as they allow the exploitation of their association with corresponding relevant terms and sentences provided by CA. Moreover, CA is able to take advantage of implicit synonymy relations based on the distributional properties of the words in the document collection More generally, multidimensional methods, such as CA or Kohonen maps, contribute to deﬁne the legal standards through their discursive function. The standards do not intervene in an isolated way: they cooperate to inter-connect facts and evidence. They allow the user to reach better conclusions on how to retrieve meaningful and useful sentences, or to understand why such retrieval is not feasible within the available data base. The study of the links between CA and Kohonen maps, as well as the analysis of the possibilities for their combined use and mutual enrichments is currently an important ﬁeld of research.

Software Note For this study, we used the academic software DTM (Data and Text Mining) developed by Ludovic Lebart and collaborators. For this purpose, new functionalities were added to the package, such as ordered list of the words and sentences characterizing an axis with respect to diﬀerent criteria. The DTM software can be freely downloaded from the Web site http://www.lebart.org.

References 1. B´ecue, M., Lebart, L.: “Analyse statistique de r´eponses ouvertes. Application ` a des enquˆetes aupr`es de lyc´eens” in Analyse des Correspondances et Techniques connexes. Approches nouvelles pour l’analyse statistique des donn´ees, Moreau J., Doudin P.A., Cazes P. (Eds), Springer Verlag, 1999. 2. Benz´ecri, J.P.: La taxinomie, T. I; L’analyse des correspondances, T. II, Paris, Dunod, 1973. 3. Benz´ecri, J.P. et al.: Pratique de l’analyse des donn´ees, T. III, Linguistique & Lexicologie, Paris, Dunod, 1981. 4. Bourcier D.: “Une analyse lexicom´etrique de la d´ecision juridique. R`egles, standards et argumentation” In: Instrumentos metodol´ ogicos para el estudio de las instituciones, B´ecue M. (Ed.), GRES-UAB, 2000, 57–69. 5. Eckart, C., Young, G.: The approximation of one matriz by another of lower rank. Psychometrica, 1, 1936, 118–121. 6. Gaussier E., Grefenstette G., Hull D., Roux C.: Recherche d’information en fran¸cais et traitement automatique des langues. Traitement Automatique des Langues, Vol. 41, no. 2, 2000.

Extraction of the Useful Words from a Decisional Corpus. Contribution

179

7. Harris, Z.S.: Distributional structure. Word, 10, 1954, 146–162. 8. Kohonen, T.: Self-Organization and Associative Memory, Berlin, SpringerVerlag, 1989. 9. Lebart, L., Salem, A., Berry, E.: “Recent development in the statistical processing of textual data”, Applied Stoch. Model and Data Analysis, 7, 1991. 10. Lebart, L., Salem, A., Berry, E.: Exploring Textual Data, Dordrecht, Kluwer, 1998.

Collective SME Approach to Technology Watch and Competitive Intelligence: The Role of Intermediate Centers Jorge (Gorka) Izquierdo1 and Sergio Larreina2 1

2

Fundaci´ on para la Innovaci´ on y el Desarrollo Tecnol´ ogico – FUNDITEC, Mare ` de D´eu dels Angels 195, 08221 Terrassa, Barcelona, Spain [email protected] funditec.es ´ Fundaci´ on LEIA CDT, Parque Tecnol´ ogico de Alava, Leonardo Da Vinci s/n, ´ 01510 Mi˜ nano, Alava, Spain [email protected] leia.es

Abstract. It has been demonstrated that Technology Watch (TW) and Competitive Intelligence (CI) are important tools for the development of R&D activities and the enhancement of competitiveness in enterprises. TW activities are able to detect opportunities and threats at an early stage and facilitate the information in to decide and carry out the appropriate strategies. The base of TW is the process of search, recovery, storage and treatment of information. The development of Text Mining solutions open a new scenario for the development of TW activities. Up to now the enterprises and organizations using Text Mining techniques in their TW and information management activities are a small minority. Only a few large industrial groups have integrated Text Mining solutions in their structure in order to build up their information management systems and develop TW and CI activities. The situation concerning smaller companies (specially SMEs) is obviously worse in which respect to the application of Text Mining techniques. This paper focus in the possible ways to introduce Text Mining solutions into SMEs, describing methodological and operative solutions that could bring them ways of proﬁting Text Mining advantages in their TW and CI activities without charging them the high costs of individual Text Mining solutions. The model presented will be centered in the collective use of advanced Data Mining and Text Mining techniques in SMEs through industrial and R&D Intermediate Centers.

1 Introduction It has been demonstrated that Technology Watch (TW) and Competitive Intelligence (CI) are important tools for the development of R&D activities and the enhancement of competitiveness in enterprises. J. Izquierdo and S. Larreina: Collective SME Approach to Technology Watch and Competitive Intelligence: The Role of Intermediate Centers, StudFuzz 185, 181–189 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

182

J. Izquierdo and S. Larreina

The base of Technology Watch (TW) is the transformation of scientiﬁc, technical and technological information in technical knowledge allowing the enterprises and organizations to achieve competitive advantage. Technology Watch activities are able to detect opportunities and threats at an early stage and facilitate the information in to decide and carry out the appropriate strategies. TW involves all the processes of search, recovery, analysis, dissemination, and exploitation of all useful technical information needed to obtain value added information from raw data coming from very diﬀerent data sources. When this kind of activities are extended to wider areas of information, dealing with information coming from economy, markets, trade and others, they are known as Business Intelligence or Competitive Intelligence (CI).

2 Development TW and CI Activities at the Enterprises The activity of enterprises involves many diﬀerent relevant areas of information, thus all this areas have to be considered when developing TW and CI activities. In a very approximate way the useful information for the enterprise can be divided into four diﬀerentiate groups (Fig. 1).

Technology R&D

- Regulations - Legislation

INFORMATIO Commercial - Products - Competi-

Other ambits - Projects - News - Events - Opportuni-

tors - Suppliers - Markets

ties/

Fig. 1. Diﬀerent types of areas of information valuable for enterprises and useful for the development of TW and CI activities

A ﬁrst group corresponds to all technical and technological information for the development of R&D activities. A second group is related with all the information about legislation and regulations that very frequently is crucial to develop diﬀerent activities in the enterprises and industries. Another area of information corresponds to all the commercial information about products, suppliers, customers, competitors and markets in general. The other group

SME Approach to Technology Watch and Competitive Intelligence

183

corresponds to diﬀerent relevant ambits of information about news, events, projects and other situations that could be related to opportunities or threats for the activity of the enterprises. As a general rule it can be said that those enterprises that perform TW and CI activities experiment a deﬁned evolution in which refers to the articulation of these kind of activities (Fig. 2). In a ﬁrst phase the enterprises begin carrying out Technology Watch activities and after they consider other types of information related to legislation and regulations and in the following phase they consider commercial information. This evolution is based in two aspects: the fact that the R&D departments and teams within the enterprises are more concerned about the need of information analysis for the improvement of the development of R&D projects and activities; and that fact that the search and treatment of economic information is in principle more complex than technical and technological information.

COMMERCIAL

TECNOLOGY WATCH

PHASE I

REGULATIONS LEGISLATION

REGULATIONS LEGISLATION

TECHNOLOGY WATCH

TECHNOLOGY WATCH

PHASE II

PHASE III

Fig. 2. Usual evolution of the development of TW and CI activities in industrial enterprises. The ﬁrst phase involves mainly TW activities, while in a second and third phase information about regulations and commercial information is taken into account for the CI activities

The development of TW and CI activities needs a basic methodology. Even though there are many diﬀerent kinds of TW activities, a basic methodology divided in ﬁve steps can be deﬁned (Fig. 3): (i)

Deﬁnition of the working areas: The previous step for the development of TW activities is the deﬁnition of the priority working areas. Within this phase the strategic lines for TW in the enterprise have to be established as well as the speciﬁc objectives.

184

J. Izquierdo and S. Larreina

1. Definition of the areas of information 2. Formation of the corpus of information 3. Storage and Process of information

4. Information analysis 5. Exploitation of the results Fig. 3. Five step methodology for the development of TW and CI activities

(ii) Formation of the Corpus of information: The corpus of information is basically all the information obtained from all the relevant sources useful for the development of TW and CI activities. This step involves the process of selection of the information sources, and the determination of their characteristics (quality, availability, accessibility, structure, contents, etc). Once the information sources are selected, the useful information is searched and downloaded. (iii) Storage of information: The recovered information is storaged in the management system of information of the organization following a methodology and structure previously designed which will allow the treatment and analysis of the information in all the selected ambits. (iv) Treatment and analysis of information: The information is processed, classiﬁed and analysed following the diﬀerent objectives of the TW planning. In this step specialized tools for the classiﬁcation and analysis of information are frequently used in order to facilitate and optimize the process. (v) Exploitation of the results: Once the information analysis is developed, the obtained value added information is distributed in diﬀerent formats to the ﬁnal users within the enterprise. As well as the methodological aspects, the development of Watch and Intelligence activities involves three basic elements: information sources, specialized tools for the search and capture of information, and specialized tools for the information analysis. The variety of useful information sources for the development of Watching activities is very wide. The huge volume of information accessible via web as

SME Approach to Technology Watch and Competitive Intelligence

185

well as the high cost of processing other more classical information sources have made internet the most used information source for the development of TW and CI activities. Internet is not only a huge and increasing storage of information but also the tool to access much more easily to most of the specialized databases covering almost totally many of the information areas. The growing relevance of internet has fostered the development of new specialized tools for the search and capture of information in internet which have been incorporated quickly as basic elements in TW and CI activities. The other key element for the development of TW and CI are the software tools for the treatment and analysis of information. The information analysis needed in most of the TW and CI activities involves huge amounts of information that would be very diﬃcult to manage without the help of this specialized software. Among the expert tools for information analysis two diﬀerent groups of solutions can be considered: Data Mining solutions and Text Mining solutions. Groso modo Text Mining tools are much more sophisticated and include semantic modules that allow them to “understand” language and extract knowledge from raw data, and so they are able to work with structured and non structured information. Data Mining solutions are useful only with structured information (information in which the contents are organized following a deﬁned pattern). Even though Text Mining solutions oﬀer potentially more advantages than Data Mining ones, they are much less used due to diﬀerent reasons: the high degree of novelty of these solutions so that the end users do not feel totally conﬁdent in its development; the lack in standardization of this solutions and the fact that Text Mining tools a usually limited to the analysis of very concrete ambits of information; and the prices of these software tools which is much more higher that usual advanced Data Mining solutions. The facts mentioned above have limited the degree of penetration of Text Mining solutions in the enterprises and so the level of development and invest in this kind of software tools specialized in the information analysis. It can be said that at the present time only the main industrial groups have considered the use of this type of Text Mining solutions while the small and medium enterprises (SMEs) have a very limited knowledge of these tools and their possibilities. Nevertheless the SMEs play a key role in economy, specially in Europe where the activity of SMEs is considered as one of the main pieces for the economic development and innovation in most of the sectors. From a general point of view, the introduction of Text Mining tools in the activities of information analysis of SMEs will have many advantages associated to the improvement of TW and CI activities, mainly for those innovative SMEs working in very dynamic and competitive areas aﬀected by the globalization of the economy and markets. There are many diﬀerent reasons to explain the low penetration of Text Mining in the enterprise environment. From one hand there is a problem of

186

J. Izquierdo and S. Larreina

lack of information about this kind of solutions and their potential advantages which seem for the SMEs only useful for the main industrial and technological groups. Another problem is their high prices because the enterprises do not feel comfortable to invest such amounts of resources in a software whose beneﬁts are diﬃcult to evaluate economically at the short and medium term. This problem is associated with the sensation at the enterprise that the Text Mining software is not suﬃciently developed and that these tools are only proﬁtable in very limited and specialized areas. This kind of tools are also seen as very diﬃcult to integrate with the other software administrative and ﬁnancial tools that are usually present in the enterprises. In the enterprises there is a feeling that Data Mining tools are much less expensive and more developed and standardized and are easier to integrate with the other administrative software tools in the enterprise.

3 Collective SME Approach to TW and CI Like it has been pointed out above, at the present time there are some concrete barriers that slow down the development and use of Text Mining solution by most of the enterprises. This barriers would be diﬃcult to overcome from the classic view of TW and CI in which each organization has to carry out isolated the activities of analysis of information. In this paper we propose a diﬀerent approximation to the problem taking as basis the role of Technology Centers and other Intermediate Centers that in Europe have a high capability of clustering and integration among the enterprises. Technology Centers oﬀer technology services and develop R&D and innovation projects and initiatives in collaboration with a large number of enterprises (most of them SMEs), acting like a common contact point of groups of enterprises in the same sector or coming from very diﬀerent sectors. The approximation proposed is the use of these intermediate organizations as the basis of a TW and CI centralized system oﬀering services of information analysis to a group of enterprises. Within this scenario the centralized system could be provided with the most advanced tools optimizing costs and resources and with a technical staﬀ specialized in TW and CI activities. Obviously the service of information analysis oﬀered by these organizations could not be extended to all areas of activity, because a part of the information is considered conﬁdential and has to belong to each of the enterprises. So, the TW and CI system would be organized in the following way: the Intermediate Center would oﬀer services of information in those considered as common areas, while TW and CI activities in restricted areas (conﬁdential areas) have to be developed particularly by each of the enterprises (Fig. 4). What would be the areas considered as common in which the Intermediate Centers would develop the collective services of TW and CI? Basically those areas of information with contents of common interest for all the enterprises like technology and R&D, regulations and legislation, projects, general

SME Approach to Technology Watch and Competitive Intelligence

SMEs

187

Intermediate Center

Fig. 4. Collective approach for TW and CI for SMEs. The intermediate Center delivers periodically information about diﬀerent areas to the enterprises

opportunities and threats, etc. The areas in which the enterprises would have to perform the activities of TW and CI at a particular level would be those more sensible to the competitiveness like commercial aspects, suppliers, customers, competitors, markets, etc.

Fig. 5. Collective approach for TW and CI for SMEs. The Intermediate Center develops the TW and CI activities for the diﬀerent SMEs for common areas of information (areas included into the blue line), while each of the SMEs has to perform TW and CI activities for particular areas (areas included in the red line)

Within this approximation the Intermediate Center would act as centralized information management system for all the enterprises in those areas considered as common (Fig. 5). This centralized unit would develop the activities of search, recovery, storage, and analysis of information, and after it would distribute the value added information among the diﬀerent enterprises that take part in the collective TW and CI system (Fig. 6).

188

J. Izquierdo and S. Larreina

INFORMATION

TECHNOLOGY WATCH UNIT

SME I

INFORMATION MANAGEMENT SYSTEM

SME II

SME III

SME IV

Fig. 6. Collective approach for TW and CI for SMEs. The Intermediate Center develops acts as the common information management system for the diﬀerent involved SMEs. Once the information is classiﬁed and analyzed is distributed to the enterprises

Our organizations (FUNDITEC and Fundaci´ on LEIA), have developed with the support of Regional Governments in Spain, pilot programmes of TW and CI using this collective SME approach for diﬀerent groups of enterprises in diﬀerent sectors like textile and clothing, metal industry, or manufacture sectors. The obtained results have been very promising at two levels, enhancing the level of TW and CI activities among these groups of enterprises, and optimizing the use of specialized tools and other resources for the development of these activities.

4 Summary It has been demonstrated that Technology Watch (TW) and Competitive Intelligence (CI) are an important tool for the development of R&D activities and the enhancement of competitiveness in enterprises. The base of TW is the process of search, recovery, storage and treatment of information. For the development of TW two main elements are needed: appropriate sources of information, and specialized software tools for the analysis of information. The development of Text Mining solutions open a new scenario for the development of TW activities. Up to now the enterprises and organizations using Text Mining techniques in their TW and information management activities are very minority. The situation concerning smaller companies (specially SMEs) is obviously worse in which respect to the application of Text Mining techniques. In this paper we have introduced a possible way to apply Text Mining solutions into SMEs, The model presented is based in the collective use of advanced

SME Approach to Technology Watch and Competitive Intelligence

189

TW and CI techniques in SMEs through industrial and R&D Intermediate Centers. This system has been used in pilot actions with groups of SMEs in diﬀerent sectors and the results have been very promising.

References 1. Escorsa, P., Maspons, R.: “De la Vigilancia Tecnol´ogica a la Inteligencia Competitiva”. Financial Times/Prentice Hall, Pearson (2001) 2. Izquierdo, J.: New approaches for Technology Watch and Competitive Intelligence in the enterprises, in preparation

New Challenges and Roles of Metadata in Text/Data Mining in Statistics ˇ es Duˇsan Solt´ Faculty of Management, Comenius University Slovakia, Bratislava

Abstract. The paper deals with the new challenges and the roles of metada and metainformation in the area of text/data mining in the area of statistics. In the ﬁrst part, the paper is presenting some basic characteristics of the contemporary statistical information systems from the point of view of the needs for utilization of metadata and data/text mining. As it is well known, modern statistical systems are characterized by an enormous amounts of various statistical data what requires also speciﬁc methods and technologies for their processing. In the second part the mutual relations between metadata and metainformation are analysed and some conclusions and recommendations for the further research and development in that problem areas are presented.

1 Introduction Traditionally, oﬃcial statistics has always been known for being one of typical examples of the so called large information systems with an enormous amount of data coming and/or to be collected from various and in most of the countries from practically all sectors and areas of the overall socio-economic life. All this various data has to be stored for relatively long periods of time. In some cases like in the demographic and/or vital statistics they have to be stored and processed for decades and even centuries. Finally all this data stored in the form of long-term time series has to be processed in such a way that the processed information is providing necessary long-term comparability and thus also necessary analytical and information potential. However, national statistical systems have always been characterized also by at least two contradictory features. One being the fact, that technological progress in data processing in statistics has always been very fast and progressive so the oﬃcial statistics always belonged among the most advanced and progressive areas of overall informatization of the society. On the other hand that technological progress has produced still more and more demands for new and new data to be collected, stored, processed and prepared for ﬁnal utilization by ˇ es: New Challenges and Roles of Metadata in Text/Data Mining in Statistics, StudFuzz D. Solt´ 185, 191–199 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

192

ˇ es D. Solt´

ever growing number of ﬁnal users of statistical information and various statistical analyses, studies, etc. Thus, the statistical systems all over the world have become one of the typical examples of the so called “information explosion” where the more produced amounts of information have generated more demands for even more information. It is then no surprise that right the oﬃcial statistics has become also the area where many new modern progressive approaches to data processing like metadata and metainformation, (very) large scale data bases, data warehouses as well as text and data mining among others have found their pilot introduction and practical utilization. In the next parts of this paper right from the aspect of metadata and their utilization, we will deal with the issues of data/text mining also on the basis of our experiences from other similar EU funded IST projects like METANET, e-Europe+ and SIBIS – Statistical Indicators for Benchmarking Information Society, etc.

2 Some Basic Characteristics of the Contemporary Statistical Information Systems from the Point of View of the Needs for Utilization of Metadata and Metainformation for Data/Text Mining As already mentioned above, traditionally the statistical systems have permanently been during all so far completed stages of the computerization and informatization of the society among the front runners regarding the full utilization of all the various latest technological advances in the information and communication technologies. As a consequence, traditionally, statistical information systems were always characterized by extremely large volumes of data to be primarily registered, entered into the computerized system, processed and mainly stored for a relatively long period of time. Thus one of the main and dominant features of all statistics has always been the necessity to store statistical data for the longest possible periods of time and in that way to achieve a long term coverage and comparability of the stored statistical data on the particular statistical phenomena and processes. This is one of the speciﬁcs of statistical data storage or archiving. While in the most of other problem areas the data archiving is legally mostly limited by the period of 5 or 10 years, in statistics the same periods are more less only the periods of short term or medium term time series while the long term time series are expected to cover the time periods of several decades, etc. It is then no surprise that especially in the statistical systems were over the time stored really large amounts of statistical data and for very long period of time. These data have been stored in at least following main storage categories as: • ﬁles/data bases of current i.e. statistical data from the latest statistical surveys that are mainly used for some operative use

New Challenges and Roles of Metadata in Text/Data Mining in Statistics

193

• ﬁles/data bases of the short-term and/or long-term time series covering a certain period of time and in case of some long-term time series going back for decades and even centuries. • these ever growing two types of data bases of mostly highly structured data led ﬁnally to the origination of the so-called very large data bases that are ﬁrst of all characterized by enormous amounts of data stored under various DBMSs and their speciﬁc requirements for organization and structuralization of data according to the particular data base models as e.g. linear, hierarchical, network, relational, etc. that enabled their direct access, retrieval but also processing and subsequent presentation and dissemination to the wide community of various users • all various other data, including original primary statistical data that were directly collected from various reporting units and were used for the creation of the previous main categories of the stored statistical data. But these primary data had to be stored also for a long period of time as through them only it was possible to secure any reconstruction, updating, restructuralization, etc. e.g. of data stored in the long-term time series as without detailed original data their reconstruction was practically impossible. Thus over the time and thanks mainly also to the ever growing capacity of the modern information and communication technologies and storage capacities there have been accumulated in the possession of the national statistical agencies enormous amounts of statistical data and especially in the last category of original data that once were already used but for various unspeciﬁed reasons and needs in the future they have to be stored on the long term basis for the potential future utilization. So, again it is no surprise that statistical agencies have been among the ﬁrst that practically experienced what now have been popularly called as data warehouses. Currently, they contain mainly those various data needed not for any immediate use but being available for various ad hoc cases if and when an user and/or methodical need would arise e.g. to update or reconstruct long term statistical time series, etc. As mentioned above, currently thanks to the enormous capacities of storage media of modern computers these warehouses of statistical data contain amounts of data being calculated not in terms of megabytes or gigabytes i.e. millions and billions of data just like a decade or so ago, but in terms of terabytes i.e. in trillions and even petabytes i.e. quadrillions of statistical data. In addition to the traditionally structured numerical data, still larger amounts also of statistical data belong among the so called unstructured textual data i.e. data consisting of plain unstructured textual strings of various analytical publications, commentaries, explanatory and or accompanying texts, etc. But their role in the oﬃcial statistics have not at all been an unimportant or supplementary only. It is a well known fact that statistical agencies have always been famous as one of the largest publishers in the world publishing regularly various periodical statistical publications very often with the periodicity of weeks or even days. While previously due to the

194

ˇ es D. Solt´

limited capabilities of former computers in the text processing most of these statistical publications were only partially published by computers i.e. statistical tables were produced by computers but the textual parts of the same publications were not, nowadays all the statistical publication system including ever growing graphical data has been produced by computers and thus automatically becoming also a part of the electronic statistical warehouses. It is quite logical, that in many cases this data are sometimes much more important than the numerical statistical data itself as only these textual data clarify the real content and general context of the particular “main” statistical data. Only this textual data clearly explain the meaning of individual statistical indicators their context, proper interpretation, development trends, etc. As a rule only this textual data in full explain to the users the proper interpretation and thus also utilization of data. What is necessary to mention in this connection, it is the fact that all this data is belonging to the so called state or oﬃcial statistics, not to mention another amounts of terabytes and petabytes of statistical data coming from various other than only oﬃcial statistical sources. With the advance of the information and communication technologies and their general accessibility in diﬀerence to e.g. former very expensive and complexly operated mainframe computers, nowadays practically any government or other public or private entity has been collecting, processing and utilizing its own statistical data. In this respect it is just enough to mention for example a still larger and more active and inﬂuential third sector of various NGOs that for their lobbyist and other campaigning activities are mostly using their own statistical and other data that in many cases are totally contradictory to the oﬃcial statistical data. Just for illustration, the statistical data on unemployment in the Slovak Republic are as diﬀerent as are ﬁgures between 13% from the oﬃcial governmental agencies, 19% of the state statistics and well over 20% of statistical data on unemployment coming from various other sources as e.g. opposition political parties, trade unions, etc. And what the whole situation makes even more complicated, all these various data from various sources are not existing only in some centralized or hierarchically organized systems but really and truly available on the world wide web and thus spread and accessible all over the globe as a part of the world wide web. That all makes the situation of users of statistical data very complicated and diﬃcult not only regarding identiﬁcation of the most reliable sources of data but even more regarding their ﬁnal interpretation. Hence, contemporary very large statistical data bases and even more statistical data warehouses have been characterized not only by enormous amounts of structured but also unstructured data existing in the form of classical numerical and/or computational data but also and even more they are characterized by ever growing amounts of unstructured textual data. But even the unstructured textual data are not the latest category of the stored data. The ever growing amount and share among all the stored statistical data belongs to graphical, visual, even voice and multimedia data that all contain very

New Challenges and Roles of Metadata in Text/Data Mining in Statistics

195

high information potential also after their primary utilization, publication, presentation and dissemination. As such they have also to be stored in the contemporary statistical data warehouses, etc. As we have already mentioned above, what complicates the overall situation with statistical data even more it is the fact that in the era of www, the statistical data is generally available and directly accessible not only from the so-called oﬃcial statistical sources but also from various other providers of statistical data. In addition to the traditional problem in diﬀerences between statistical data from the oﬃcial statistical agencies and from various ministries and other central organs and/or other data suppliers, there is currently also problem of comparability and interpretation of data coming from various oﬃcial international agencies, organizations. For example if one takes statistical data from various sources on FDI as e.g. from the IMF or the World Bank, UNCTAD, OECD or the EU it is quite possible that the same indicators for the same country and the same period contain completely diﬀerent amounts of inward or outward FDI although logically the total of inward and outward ﬂows have to be the same. But as a rule they are not, as diﬀerent providers use to use diﬀerent methodology, scope of survey, etc. The same is true also regarding intra-national sources e.g. between data coming from national statistical agencies or from central banks or other national authorities. Hence, the problem of interpretation of statistical data from diﬀerent sources is not only regarding the data itself but also with their interpretation. Hence, the second main feature of the development trend regarding the statistical data warehouses and the particular data mining especially from those very large data stores of various often unspeciﬁed, unstructured and not speciﬁcally organized statistical data is not only the problem how to access and retrieve those warehoused data but especially how to utilize their often not fully discovered and thus unknown and/or hidden information value, interpretation and ﬁnally their usage. In this connection one of the most promising tools and methods for not only accessing and retrieving statistical data i.e. in fact for data mining of data coming from diﬀerent sources and stored in data warehouses is utilization of metadata and metainformation. They could serve not only as tools for particular initial data mining but also for its later proper interpretation and thus also ﬁnal utilization. Thus metadata and metainformation fulﬁll in connection with data mining two key functions: • the ﬁrst one is to serve as a tool for the mining itself i.e. discovering and retrieving data from the particular data warehouse • the second one is to assist in the proper interpretation of the already mined data.

196

ˇ es D. Solt´

3 New Challenges for Metadata and Metainformation and Their Double Role in Data Mining In view of the above two functions of metadata and metainformation, one of the main further development trends in the statistical data mining has to be oriented towards the research regarding the potential of metadata and metainformation as tools for data mining. After all the biggest problem of the whole data mining from any large data warehouse has always been to identify and specify what data and in what form, on what methodical basis the particular data were based. And this has been right the main role of metadata and metainformation in general hence the research in this ﬁeld has to focus mainly on the issues how the generally already well known functions of metadata could be used eﬃciently for the speciﬁc needs of data mining of data stored in the data warehouses including those unstructured, textual data and various other data and thus to make them accessible for the further secondary and/or tertiary utilization after their primary original utilization for the actual data bases and/or historical statistical time series. As it is well known from practical application of commercially available data mining systems the results of their data mining in many cases is not correspondingly accurate and does not correspond to the expectations of users. For example just in connection with preparation of this paper we have used one of the commonly used searching (data/text mining) engines on the web with specifying as the searching and/or mining criteria the name of this author i.e. DUSAN SOLTES. The results were more than interesting but at the same time also disappointing. The system has reported altogether 111 pages in English on the web that matched the particular selection and/or mining criteria. However, only few of them have in full corresponded to the expected results. The rest has not met the expected criteria as: • in addition to proper combination of the above two names there were with DUSAN associated also such family names like: M. Berek, Garay, Pavcnik, Keber, Staples, mramor, Krstanovic, Levicky, Djurik, C. Stulik, Wunder, Kurta, Belak, Rak, Valachovic, B. Kratscch, K. Stankovic, etc. • in combination with the family name SOLTES were associated and selected also such ﬁrst names and/or initials like: Ladislav, Vincent, Amy, Igor, Peggy, David, Joseph G., Ed, Andrej, Cathy, Ori Z., Juraj, Rudolf, Waren E., Barbora A., etc. Also from this simple example it is clear that data/text mining through some commonly available searching and or mining machines is not yet the most reliable. As also this small example demonstrates, the mining is in principle based on the logical operator OR and not more suitable and appropriate AND. It is possible to expect that with specifying more complex mining criteria the result would be more reliable than in this rather simple illustrative case but that requires also more complex mining algorithms and procedures and their more reliable selection criteria and their speciﬁcations. Even if we have

New Challenges and Roles of Metadata in Text/Data Mining in Statistics

197

stipulated more precisely that we are interested in Mr. i.e. male only, the result was not much more reliable as the result was then just combination of Mr. Soltes with Mr. associated with various other names from the same “ﬁle” of related information e.g. of various conferences, publications and/or any other common objects. From these examples it is evident that the role of metadata and metainformation in better functioning of the particular mining machine was not properly utilized. In this particular case it would be much more beneﬁcial if the particular text mining system would to the larger extent used some of the basic functions of metadata such as a thesauri, data dictionary, data directory that would more completely characterize the object of the particular search with more speciﬁcally oriented questions that would help more speciﬁcally determine the object of the particular data/text mining. It is quite clear that without such wider and more elaborated utilization of metadata and/or metainformation the primary function of data/text mining i.e. to ﬁnd on the web or in the particular warehouses or data bases all the relevant data is just simply almost impossible and makes the whole data/text mining only partially beneﬁcial from the very scratch of the whole process. It is quite evident that all subsequent and expected beneﬁts of the data/text mining as editing, storage, processing and presentation of the mined data is then in jeopardy as well as their success in full depends upon the success of initial data mining. However, this primary or searching function of metadata and metainformation in the whole data/text mining process is just the ﬁrst one although the most crucial and important regarding the whole mining process. The other one and in many aspects even more important are the functions of the metadata and metainformation in all subsequent phases of an extended data mining process including its ﬁnal phase when the mined data have to be processed and prepared for ﬁnal presentation and disseminated to the ﬁnal users. In this connection several other important functions of metadata and metainformation come into fore. First of all, these are the functions that could provide the proper speciﬁcation and interpretation of the mined data. In this connection primary functions belong to such metadata instruments at least in the context of statistical information like: • catalogues of statistical indicators that specify and deﬁne individual statistical indicators and their main properties and characteristics like object of measurement, deﬁnition, measurement unit, time characteristics, periodicity, etc. • catalogues of statistical surveys with speciﬁc information on the basic characteristics of the particular statistical surveys, their scope, reporting and observation units, survey time and periodicity frame, etc. • catalogue of statistical terms containing deﬁnitions of objects of statistical surveys and measurements as e.g. deﬁnitions of such basic concepts like a SME. Just for illustration there is quite a big diﬀerence in deﬁnition of a small business in the USA and e.g. in the Slovak Republic and a small

198

ˇ es D. Solt´

enterprise in the former one belong \s among the large companies in the latter one • catalogue and/or registers of reporting units containing information i.e. metainformation in the composition of the observation and/or reporting units of the particular survey, etc. • an another inevitable and integral part of any data/text mining process regarding the proper interpretation of the mined data is the whole set of all relevant and utilized statistical tools, instruments and methods for individual statistical surveys such as glossaries, thesauri, classiﬁcations, nomenclatures and code lists as they all are integral part of any statistical information system and processing of statistical data as without them it is not possible to achieve any more speciﬁc interpretation of the processed statistical data and information. In general, a better and more elaborated utilization of metadata and metainformation in the process of statistical data/text data mining could make the whole process of mining more reliable and accurate. But in general it requires an enormous amount of work to be conducted in connection with the creation of the particular metadata bases for statistical data similarly as it has already been dne in the area of some more speciﬁc and thus also narrower areas of data/text mining. For example as it is available on the web http://www.drugresearcher.com in the biological literature the sciencists have already prepared an elaborated system of catalogues of classes of basic terms, concepts, objects, etc. covering altogether so far 33 categories of the particular biological terms, phenomena, etc. It is quite clear that the volume of such basic concepts in the are of one specialized area like e.g. biology is not as complex as it is in the case of statistics covering all parts of the overall socio-economic life but sooner or later also oﬃcial statistics will have to proceed in the same way. There is no other way if we would like to have also in the statistical data/text mining some evident progress.

4 Final Remarks On the basis of the above results of our research it is our understanding that the further development and research in the area of statistical data/text mining has to be among other oriented also towards better and more consistent utilization of the results e.g. also from an another Network of Excellence funded by the European Commission i.e. the METANET Network of Excellence where the whole range of potential beneﬁts but also techniques and methods for handling statistical metadata and metainformation have been identiﬁed regarding potentially also the direct utilization of metadata and metainformation in the framework of this statistical data/text mining as we have identiﬁed such opportunities also in this paper. Hence one of the partial research tasks has to be based on searching the ways and means how to utilize the METANET outputs for the speciﬁc needs of the statistical data/text

New Challenges and Roles of Metadata in Text/Data Mining in Statistics

199

mining. One of the most feasible and easiest ways how to achieve the particular wide practical utilization of metadata and metainformation in the (statistical) data/text mining especially related to those data on the web is to introduce a kind of an international standard that would require from all Internet providers to accompany their (statistical) data on the web by a type of metadata/metainformation labeling i.e. an obligatory label that would contain all necessary minimal speciﬁcations of the particular data i.e. similarly like it is common in any warehouse where there is existing a standardized and generally accepted and utilized speciﬁcations of the stored merchandise through the particular classiﬁcations, nomenclatures, code-lists, etc. After all, it is nowadays already quite a routine practice that data and information have the same characteristics and “market” values and prices as any other products or merchandises.

References 1. METANET, Metainformation Systems in Statistical Oﬃces, MetaNet Work Package 1: Methodology and Tools, EU IST-1999-29093 Project, Brussels 2003 2. SIBIS – Statistical Indicators for Benchmarking the Information Society, Measuring the Information Society in the EU, the EU Accession Copuntries, Switzerland and the US, Pocket Book 2002/03, Empirica Bonn 2003 3. Olivia Parr Rud: Data Mining Cookbook Modeling Data for Marketing, Risk and Customer Relationship Management, John Willey & Sons International Rights, Computer Press, 2001

Using Text Mining in Oﬃcial Statistics Alf Fyhrlund, Bert Fridlund, and Bo Sundgren Statistics Sweden, Box 24 300, SE-104 51 Stockholm, Sweden {alf.fyhrlund, bert.fridlund, bo.sundgren}@scb.se scb.se/indexeng.asp

Abstract. There is a tremendous increase in the number of actors in the statistical arena in terms of producers, distributors, and users due to the new options of the web technology. These actors are not suﬃciently informed about the technological progress made in the ﬁeld of text mining and the ways in which they can beneﬁt from these. The NEMIS project, and especially its Working Group 5, aims to identify possible applications of text mining in the world of production and dissemination of oﬃcial statistics. Examples of such applications might be advanced querying of document warehouses at websites, analysing, processing and coding the answers to open-ended questions in questionnaire data, sophisticated access to internal and external sources of statistical metainformation, or to “pull” statistical data and metadata from the web sites of sending institutions.

1 What is Text Mining? Text mining is a methodology (class of methods, tools1 , and work-practices2 ) that enables people to obtain useful information, given a certain purpose, from (typically very large) sets of more or less heterogeneous and unstructured data, e.g. free-text data with or without subsets of more structured data like tables and graphs. The concept of text mining has emanated from the ﬁeld of data mining. Data mining is a methodology (class of methods) that enables people to obtain useful information, given a certain purpose, from large sets of homogeneous and structured data, usually quantitative and/or categorised data. Data mining includes traditional methods for statistical analysis, but goes beyond those methods. Web mining is text mining (or data mining) performed on data sets that are available via the Internet. 1 2

Tools include software products and auxiliary data sets like thesauri. Work-practices prescribe how methods and tools should be applied to solve problems.

A. Fyhrlund et al.: Using Text Mining in Oﬃcial Statistics, StudFuzz 185, 201–211 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

202

A. Fyhrlund et al.

2 The Present Challenges for Statistical Information Systems The Internet is a relatively new channel for distribution of information. The supply of data has increased in a dramatic way but the quality is not always obvious and declared in a distinct way and it is common to refer to information over-ﬂow and problems in searching for knowledge and intelligence. This applies for the global statistical system as well. Previously, it was an aﬀair for a quite narrow group of specialists with good control over standards and distribution tools for statistics. But today oﬃcial and other statistics are exposed to everyone who can surf on the web. Communication tools using the Internet have a strong inﬂuence on the production and the dissemination of statistical data, and National Statistical Institutes (NSI) have less control over the ﬂow of statistical data than before. There is a tremendous increase in the number of actors in the statistical arena in terms of producers, distributors, and users due to the new options of the web technology. This development is a big challenge for the statistical community and oﬃcial statistics. The new information society requires knowledge management and business intelligence. Democracy needs valid and reliable statistics in the political process in order to work well. These are examples of areas where statistics have an important role to play. One problem is the changing conditions for production, dissemination, and quality control of statistical data. A huge supply of data from diﬀerent producers and distributors is not a guarantee for accessibility of relevant information for diﬀerent needs among users. Another problem is the combination of traditional paper-based distribution channels and new distribution channels using electronic publishing. Statistics are stored in a variety of structured databases and/or displayed as unstructured data in diﬀerent formats, which makes it even more complicated than before to search and ﬁnd data among the growing number of producers and distributors. At the same time the number of users is increasing from new categories, e.g. in the business sector. Old, experienced users – as well as new, unskilled ones – need understandable and accessible metadata in order to draw the correct conclusions from statistics – often in new contexts, transformed into indicators, and presented and designed in new visualisation formats. The global society demands internationally comparable regional and national oﬃcial statistics based on harmonised data from national producers. Good accessibility to such oﬃcial statistics with accompanying metadata is a pre-condition for eﬃcient work in the development of national societies as well as in international co-operation and within the European Union.

Using Text Mining in Oﬃcial Statistics

203

3 Stakeholders in Oﬃcial Statistics The main stakeholders in (systems for) oﬃcial statistics are: • users of oﬃcial statistics, e.g. researchers, analysts, actors on ﬁnance markets, teachers, students, journalists, politicians, and the public at large • producers of oﬃcial statistics, including designers, operators, managers • respondents to statistical surveys and other types of statistical systems • owners (funders) of oﬃcial statistics: government, parliament, tax-payers

4 Considering the Potential for Text Mining in Connection with Oﬃcial Statistics The purpose of this paper is to make a systematical investigation of possible applications of text mining in connection with oﬃcial statistics. We consider • the processes associated with planning, operating, and evaluation of production and usage of oﬃcial statistics • the diﬀerent categories of stakeholders who have roles to play in connection with these processes Thus we will investigate which stakeholders may possibly beneﬁt from using text mining methods, tools, and work-practices in peforming diﬀerent processes in connection with oﬃcial statistics. We will not here discuss the methods, tools, and work-practices as such – that is the task of other Work Groups of the NEMIS project – but we will focus on possible applications in connection with oﬃcial statistics. Some of these possibilities may be realistic and may ultimately become realised, whereas others may turn out to be too far-fetched and may not materialise in practice, at least not in a near future.

5 How Can Text Mining Be Used in Connection with Oﬃcial Statistics? There is a potential for using text mining methods in all processes related to the production and use of oﬃcial statistics, where people need to obtain useful information and more structured data from more heterogeneous and less structured data. Probably there is at least some potential for using text mining methods in almost all processes related to the production and use of oﬃcial statistics, including • planning of statistical systems and processes • execution and monitoring of statistical systems and processes • evaluation of statistical systems and processes

204

A. Fyhrlund et al.

6 Use Cases for Text Mining in Oﬃcial Statistics Here we will indicate a number of more or less realistic use cases for text mining in connection with oﬃcial statistics. The use cases emanate from a systematic walk-through of the processes indicated in Figs. 1 and 2 in combination with a systematic consideration of tasks and information needs of the stakeholders involved in oﬃcial statistics. The use cases need to be further elaborated an discussed in co-operation with the other Work Groups of the NEMIS project.

7 Use Cases Related to the Tasks and Information Needs of Users of Oﬃcial Statistics 7.1 Find Potentially Relevant Statistical Data Given a Certain Problem or Theme of Study • Assist users to describe their problems and/or information needs, e.g. by means of free text. • Find potentially relevant data (more or less unstructured) given the problem description. 7.2 Help a Certain User to Interpret Retrieved Statistical Data and Metadata • Estimate the frame of reference of the user. • Given the user’s assumed frame of reference, rephrase and augment the given statistical data and metadata. 7.3 Help Users to Compare Similar or Related Data from Diﬀerent Sources • Are there similar statistical data produced by other countries? Are the data comparable? • Would it be possible to link these data (from this source) with those data (from that source)?

8 Use Cases Related to the Design and Operation of Statistical Systems 8.1 Assist in the Identiﬁcation of User Needs • Identify user needs by exploring the web: what topics and issues are people discussing, how could these discussions beneﬁt from (better) statistical data?

Using Text Mining in Oﬃcial Statistics

205

8.2 Assist in the Identiﬁcation of Possible Sources of Data • Identify and locate possible data sources, and compile descriptions of these data and their quality. • Can existing data be transformed and (re)used for new purposes? • Would it be possible to combine data from diﬀerent sources in order to satisfy a certain information need? 8.3 Assist in Diﬀerent Design and Operation Tasks • Find descriptions of comparable designs or references to such descriptions, and identify and locate the experts behind such designs. • Provide methods and tools for data editing (data cleaning). • Provide methods and tools for automatic coding of free-text data, e.g. responses to open questions in a questionnaire. • Assist in the harmonization of statistics from diﬀerent countries, in diﬀerent languages, concerning related topics, etc. 8.4 Assist in the Design of Resources and Components in Statistical Systems • Assist in the development and updating of classiﬁcations, thesauri, etc. • Help to alert statisticians about the need for new or revised concepts and classiﬁcations. Oﬃcial statistics are sometimes accused of describing society in terms of categories that were relevant 30 years ago; could that be improved? • Could text mining and thesauri improve each other (both ways)?

9 Use Cases Related to the Needs of Respondents to Statistical Systems 9.1 Help Respondents to Become Motivated and to Do Their Job Properly • Which are the intended usages of the data that I provide regularly to the statistics producer? To what extent are the statistics produced from my data actually used, judging, for example, from references to those statistics? What could be the value of those usages? Are the users satisﬁed with the statistics they get from my data, or would it be better for them (and maybe even for me) if I reported some other data or in some other way? • Are there other respondents who have similar problems as I have, when I must provide the data to the statistics producer? Could we have some communication about our experiences and maybe even come up with proposals for better concepts and procedures?

206

A. Fyhrlund et al.

10 Use Cases Related to the Needs of Managers and Owners of Statistical Systems 10.1 Help Managers to Evaluate Existing Statistical Systems from Diﬀerent Points of View • Are others doing more or less the same in a more eﬃcient way (less expensive, with better quality in diﬀerent dimensions)? Who? How? • Are the statistical data that we are producing giving a relevant and coherent picture of (a certain part or aspect) of society, or do other descriptions that exist (e.g. in books and newspapers) indicate that our data are irrelevant or fragmentary? • Are people using our statistics (judging from references to them that can be found), and who are using them? Are there needs for statistics that are not fulﬁlled at present? Is there a potential for using existing statistics more extensively and by new categories of users? 10.2 Help Funders Judge Whether They Get Value for Their Money • Identify and describe existing users and usages of the funded statistics. Estimate the total of those usages • Is there a potential for ﬁnding co-funders of (part of) the statistics produced? • Is there a potential for producing valuable new statistics at a low marginal cost by using existing data in new ways and in combination with other data?

11 Statistical Information Systems and Text Mining There are many reasons that have guided the NEMIS project to the decision of examining the speciﬁc needs of oﬃcial statistics. To start with, we should keep in mind that oﬃcial statistics are very much related to the eﬃcient production and use of information. One of the main tasks is the collection of raw data (primitive information) and the transformation of these data into useful information, which is quite similar to what text mining does. Certainly, the input and output of statistical information systems are to a large extent quantitative data that are not the typical domain of text mining. However, it should be clear from the basic review of statistical production given above, that there are important areas where non- or semi-structured textual information is involved and consequently text mining could be applied. If you today put a question to a representative of a National Statistical Institute about the organisation’s use of text mining techniques, the answer will probably be either that they have no needs of text mining, or that they

Using Text Mining in Oﬃcial Statistics

207

must take interest in text mining techniques in a near future but that they do not have any experience. It is clear from our current experience that research results and developed software products relating to text mining have not been suﬃciently marketed to the users of statistics, who are not sufﬁciently informed about the technological progress made in this ﬁeld, the available technological solutions, and the ways in which they can beneﬁt from these. This is caused by various reasons, for example that text mining is a relatively new ﬁeld and that recent developments have been so fast that there was not enough time to transfer the results to the user audience. Moreover, research activities and technological developments in Europe have been quite uncoordinated so far, and this has hampered successful market penetration. Having identiﬁed the above gap, one task of our project, and especially its Working Group 5, is to undertake actions that will enable: • Clear identiﬁcation and description of potential users of text mining in connection with statistics. The users will be described and properly categorised into user categories each one having common needs and expectations. • Identiﬁcation and recording of existing (and to a large extent uncovered) user needs, with reference to the previously deﬁned user typologies. • Identiﬁcation of the relations between existing technology (research results, commercially available products, prototypes of projects, etc.) and the previously identiﬁed user categories and – subsequently – requirements. • Analysis of the possible (currently available) solutions. • Deﬁnition of the needs that are not yet covered by available solutions. To accomplish this, we have invited NSI:s from many European countries to participate in this work and provide us with their guidance and relevant experience for exploring the potential of text mining tools in connection with statistics. Obviously, these actions are also dependent on the mapping of the current strengths and weaknesses of text mining and web mining methodology and corresponding software applications that is the focus of other activities in the NEMIS projects.

12 Possible Uses of Text Mining for Oﬃcial Statistics Below we oﬀer some tentative examples of tasks in the realm of statistical information systems where text mining methods and applications may be considered. 12.1 Dissemination Text mining applications should have an important role to play in the dissemination of data, to allow for sophisticated querying of document warehouses of statistical reports and other unstructured data at websites presenting oﬃcial statistics, i.e. natural language querying allowing for complex questions.

208

A. Fyhrlund et al.

Another example of such applications could be a parallel analysis of repeated surveys (or other types of statistical products) and of the textual reports presenting the results. 12.2 Web Automatic Answering Additionally, text mining could be used for automatic answering services for external questions (and other demands) sent to the NSI web site. 12.3 Open-Ended Questions in Questionnaires In statistics most answers to questions are predeﬁned and already coded (for reasons of ease of processing). However, open-ended questions for “free-text” answers are useful for • Gathering general opinions that are not classiﬁed under the pre-coded questions. • Suggesting dimensions missing among the coded questions and thus to better the “meta”-knowledge of the domain. • Implying evidence that the individuals do not understand the questions in a same way (which often is less evident with the answers to closed questions). • Supplying unexpected information, for example the existence of diﬀerent vocabularies for diﬀerent categories of respondents, and answers diﬀerent from the proposed pre-coded items. Open-ended questions are also used in the framework of the “test surveys” for unfamiliar domains, to deﬁne the categories that will be used in the deﬁnitive surveys. You may also for example organise discussion groups on the Internet to collect commentaries and opinions concerning a certain survey. It is obvious that the answers to open-ended questions may contain a great amount of valuable information and provide an important basis for decisionmaking. However, in many cases, this information remains unexploited, since coding and other types of processing of these answers may be diﬃcult and time-consuming. The interpretation of these responses could be supported with the appropriate use of text mining applications, for example to extract a list of keywords (semantically important lemmas) for further analysis. The text mining analysis could also be supplanted by traditional data mining techniques to jointly analyse the closed and the open-ended questions 12.4 Automatic Coding A related ﬁeld is automatic coding, that is to perform automatic classiﬁcation of non-structured answers according to some existing code list. Example of such “quasi-closed” questions given using free terms are birth place, nationality, residence or activity places, and occupation for individuals,

Using Text Mining in Oﬃcial Statistics

209

and the activity and activity places of companies and other organisations. In many such cases, the methods of automatic coding that are employed today do not always give good results, so an element of manual veriﬁcation is still necessary. More sophisticated techniques based on text mining may bring an improvement. 12.5 Record Linkage A special form of automatic coding concerns the matching of registers. One may need to recognize that two registers, in a same ﬁle or in diﬀerent ﬁles, correspond to the same individual. In the cases where all the objects (e.g. persons) do not have a non-duplicate identity number there are matching problems. for example, you may have to use surname and given name to match the individuals, with all the problems that come from the lack of standardization and spelling errors. 12.6 Data Capture Up to now, producers of statistics have collected data mainly from surveys or via administrative registers. By exploiting text mining solutions, it might be possible for statistical oﬃces to supplement or supplant traditional forms of data collection (like surveys or administrative registers) with data sources for text extraction and collection which have not been used so far. In some cases they even have obligations to preferably use available information before contemplating new direct surveys. Loads of potentially useful data are “hidden” inside archives, document collections, and other similar sources. Without text mining solutions it has not been possible to use and exploit these sources eﬃciently, and consequently a lot of data remain unused. For example, in most cases, when statistical data and metadata exchange is concerned, either the data are ﬁlled in by the sending institution on a form (e.g. electronic questionnaires) and sent to the receiving institution or the sending institution prepares and sends a data ﬁle to another institution (ﬁle transfer). An alternative model, using web-mining techniques, would be the receiving institution to “search” (with this search being text/metadata based) the web site of one or more potentially sending institutions’ and to “pull” the data and metadata it needs (probably using XML intermediate technologies). Web-mining tools and applications for discovering updates and personalisation of information in statistical databases and in publishing systems are other examples of such solutions. 12.7 Knowledge Management In addition to that, it should be noted that text mining could oﬀer methods and solutions for assisting knowledge management and competitive intelligence. There is an increasing demand emanating from the latest changes

210

A. Fyhrlund et al.

(globalisation, New Economy, etc.) to create and support the knowledge-based society, for example, EUROSTAT has expressed its interest to develop and support knowledge management systems to be used by the National Statistical Institutes. Statistical oﬃces face a lot of “new” user requirements for better availability, search ability, interpretability, etc., of the statistical data and metadata they disseminate. They are also asked quite often to assist governments in making public data more available, in order to improve the service and information openness to the citizens from the governments and their agencies. We expect the NEMIS project to contribute to the achievement of these objectives. 12.8 Metadata Text mining technology should be applicable to non- or semi-structured sources of statistical metadata. This is the nature of many of the exogenous sources of statistical metadata referred to above, like books and articles about how to design statistical systems and subsystems, processes, and components of such systems, as well as books and manuals on software tools and metadata holdings, supporting statistics production. There are also reports on evaluations and comparisons of statistical systems and other compilations of documented and systematised experiences from diﬀerent types of statistical systems and processes, etc. – sometimes organized as “current best methods” and “current best practices”. Likewise, the evaluation and maintenance of a local statistical system should require documentation of the present system, and other kinds of information, like special evaluation studies. These sources are of special interest to designers and developers of statistical production systems, as well as evaluators and managers, who would often like to make so-called benchmarking comparisons. Producers need good documentations and quality declarations concerning the statistical data as such, as well as the processes, statistical and administrative, behind the data, to run the production systems, and in order to convey knowledge about the systems to new staﬀ members. However, advanced users – like researchers, analysts, and other statistical organisations – may also need this information about the processes and tools behind the outputs, for example the survey questionnaire referred to above, or the rules and procedures applied in the data preparation processes (data entry, coding, editing, etc) that have decisive eﬀects on the quality of output data. Other categories of users, like students, teachers and interested citizens are well served by data being accompanied by good explanations and illustrations Another type of global knowledge is information about statistical outputs that are available, e.g. in the form of searchable overviews, catalogues, and indexes. Users of statistics are the obvious consumers of this type of metadata. Text mining and Web-mining applications could ensure more sophisticated access to these kinds of internal and external sources of metainformation.

Using Text Mining in Oﬃcial Statistics

211

S takeholders P rocesses U se cases

M ethods Tools W ork-practices

Fig. 1. A major task of the NEMIS project: to match the needs of oﬃcial statistics with the potentials of text mining methods, tools, and work practices

13 Conclusion Figure 1 summarizes the task of the NEMIS project that we have discussed in this paper: the task of matching the needs of oﬃcial statistics with the potentials of text mining methods, tools, and work-practices.

References 1. M´ onica Becue, Bert Fridlund, Alf Fyhrlund, Albert Prat and Bo Sundgren: Working Group 5: User Aspects & Relations to Oﬃcial Statistics. Barcelona and Stockholm (2002) 2. Bo Sundgren: The αβγτ -model: A theory of multidimensional structures of statistics. Paper prepared for the MetaNet conference in Voorburg, the Netherlands, 2–4 April 2001 3. Bo Sundgren: Developing and implementing statistical metadata systems. Deliverable D6 of the MetaNet project, Stockholm (2003)

Combining Text Mining and Information Retrieval Techniques for Enhanced Access to Statistical Data on the Web: A Preliminary Report Martin Rajman1 and Martin Vesely1,2 1

´ Ecole Polytechnique F´ed´erale de Lausanne, EPFL Lausanne, Switzerland {Martin.Rajman, Martin.Vesely}@epfl.ch 2 CERN, Conseil Europ´een pour la Recherche Nucl´eaire [email protected] Abstract. In this contribution, we present the StatSearch prototype, a search engine that enables an enhanced access to domain speciﬁc data available on the Web. The StatSearch engine proposes a hybrid search interface combining query-based search with automated navigation through a tree-like hierarchical structure. The goal of such an interface is to allow a more natural and intuitive control over the information access process, thus improving the speed and quality of the access to information. An algorithm for automated navigation is proposed that requires natural language pre-processing of the documents, including language identiﬁcation, tokenization, Part-of-Speech (PoS) tagging, lemmatization and entity extraction. Structural transformation of the available data collection is also performed to reorganize the nodes in the information space (the Web site) from a graph into a tree-like hierarchical structure. This structural pre-processing (transformation of a graph structure into a tree-like hierarchy) can be done either by document clustering, or, alternatively, derived from existing structure of the document collection by splitting, shifting, or merging of nodes where necessary. The clustering approach is more straightforward but requires that the intermediate nodes in the created tree are assigned understandable descriptions, which corresponds to a diﬃcult task. Target documents are represented by weighted lexical proﬁles the components of which correspond to triples of the form (surface form, lemma, PoS). The extracted and normalized terms and entities are weighted using the TF.IDF weighting scheme. Document relevance is computed as the textual similarity between the query and document proﬁles. Several well known similarity functions from the ﬁeld of information retrieval have been tested, including the Cosine and Okapi BM25 similarity measures. In addition to the similarity score, the contributions of all the query terms to the computed document similarities are also provided. The principle of the presented algorithm for automated navigation is to compute a score distribution on the documents (leaves of the tree), and to propagate the obtained scores upwards in the tree. The node scores are then used to guide a M. Rajman et al.: Combining Text Mining and Information Retrieval Techniques for Enhanced Access to Statistical Data on the Web: A Preliminary Report, StudFuzz 185, 213–222 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

214

M. Rajman and M. Vesely

faster, partially automatic, downward navigation in the tree. In particular, user intervention for node selection is only required for nodes with children corresponding to a score distribution where no clearly good candidate can be identiﬁed. Otherwise, the (possible partial) traversal of the tree is performed automatically. Several approaches are compared for the automation of the navigation. They include decision rules based on relative (resp. absolute) minimum best score diﬀerences, as well as on information theoretic measures. The automated navigation algorithm also allows a more reliable document ranking by giving to the user the possibility to restrict the search to the set of documents dominated by a speciﬁc node or to the documents matching a limited set of document types. The presented hybrid search technique has been implemented in the StatSearch prototype that has been realized in collaboration between EPFL, Statistics Sweden (SCB), and CERN, in the framework of the NEMIS network of excellence. The prototype focuses on domain of oﬃcial statistics, and currently uses a database of over 5000 full text documents, tables and graphs in English accessible at the SCB Web site.

1 Introduction Information access is understood as a process of identiﬁcation and presentation of information that corresponds to user information need expressed by a user query. Since both user queries and document representations are typically based on sentences in natural language, in order to address the performance of the information access we focused on combination of the text mining (TM) and the information retrieval (IR) techniques. In order to demonstrate our approach we have done a case study on the Web access to information in the domain of statistics. We have developed a prototype of an information access tool integrating querying and navigation. The StatSearch prototype has been implemented and tested on statistical documents provided by Statistics Sweden (SCB) amounting to over 5000 items in English. The SCB Web site can be eﬀectively accessed with this parallel interface that integrates more information about the Web site structure that is not necessarily known by users.

2 Access to Domain Speciﬁc Data Domain speciﬁc data introduces several issues to be addressed. Among these the most important ones are (i) the use of domain speciﬁc vocabulary, (ii) existing speciﬁc metadata, (iii) various information presentations and (iv) various user backgrounds that scale from unexperienced users to highly experienced ones represented by domain specialists and experts [3]. Domain speciﬁc vocabulary of statictics has been a subject of research of several initiatives. Several relevant resources might be mentioned as a base for

MIR Techniques for Enhanced Access to Statistical Data on the Web

215

document processing of statistical data. Namely one should cite the ISI Glossary of statistical terms1 , Statistical Data and Metadata eXchange (SDMX)2 and Metadata Common Vocabulary (MCV). Statistical information is provided to end user in a document in a predeﬁned form. According to this presentation form we distinguished several document types, most importantly the statistical messages (publications), press releases, statistical database forms, domain portals and individual tables or charts. Domain speciﬁc information needs can be categorized and most frequently solved tasks can be identiﬁed. These tasks are very often triggered by expected or unexpected events. As an example let us mention documents published by authorities, elections, political declarations, etc. Typical users of statistical information could also be proﬁled. In order to reveal the particularities of the domain of statistics we have undertaken a few interviews with the domain specialists at SCB3 . In summary, we have gathered the following information relevant for our case study: • Information at SCB is usually requested via Web, other options include requests by e-mail or by phone. • Typical user proﬁles correspond to users from academics and research (ca. 40%), users with business background (ca. 35%), journalists and students. • Majority of requests suﬀer from ambiguities and have to be often reﬁned by interaction with the user. Often the communication happens to switch to personal mailbox. • Requests relevant to SCB are redistributed internally to competent specialists, other requests are re-directed to partner institutions. SCB collaborates with another 25 governmental organizations that are entitled to provide oﬃcial statistics in Sweden (for full list see Appendix A). Other organizations may potentially be contacted, these include banks or research institutes. • 90% of queries/documents relate to the Swedish national statistics, remaining 10% relate to international statistics.

3 StatSearch Prototype The access to the domain speciﬁc information is demonstrated on a prototype that we implemented with regard to the mentioned requirements of domain speciﬁc data. This prototype can be regarded as NLP-based search engine focusing of eﬃcient combination of querying and navigation features of the information access process. The prototype builds upon the work presented 1 2 3

http://europa.eu.int/comm/eurostat/research/ http://www.sdmx.org/ We focused on the domains of National Accounts, Citizen Inﬂuence, Labour Market and Prices and Consumption

216

M. Rajman and M. Vesely

in [2] focusing on document processing, textual similarity computation, automated navigation and the user interface optimization compliant with the human-computer interaction principles. 3.1 Document Processing Documents are pre-processed using several NLP techniques in order to obtain semantically coherent document representations and features included in this representations are then indexed. Since we worked with the Web-based documents, we focused on the information extraction from HTML ﬁles. The feature selection is done in compliancy with following criteria: (i) feature appears in a relevant HTML tag, (ii) the value of the tf.idf weight is important and (iii) feature belongs to semantically relevant morphological category. Both original and lemmatized forms of words are used to build the document proﬁle as the document representation. Canonical forms of features were obtained by morphological normalization, language identiﬁcation and data cleaning. Morphological Normalization In order to ﬁlter out features with semantically irrelevant PoS with regard to the document representation purpose and we kept only words that belong to morpho-syntactic category of an adjective and a substantive4 . We also tried to identify word compounds based on their co-occurencies that are then treated both as one feature and as separate features of individual words. The extracted vocabulary amounts to 2346 non-canonical content bearing words extracted from titles and additional 5574 non-canonical content bearing words extracted from the rest of the document. Metadata Extraction According to the principle that content related features should be separated from the ones related with the nature of the documents, selected metadata such as document type or time/space relevance was extracted. Extracted metadata are then used as ﬁltering feature during the information access process. Filtering can be tuned so the documents are either excluded from the result set or shifted in the displayed document rank. Language Identiﬁcation At the feature extraction step we have encountered the need for language identiﬁcation on the term level. Indeed, documents presented on the Web pages 4

We used the FreeLing tagger from the UPC of Barcelona (http://www.lsi.upc.es/ ∼nlp/freeling/) and the sylex tagger (http://issun17.unige.ch/sylex/intro.html)

MIR Techniques for Enhanced Access to Statistical Data on the Web

217

often contain multi-lingual content and since both lemmatization and PoS tagging are language dependent tasks, introducing the language identiﬁcation step was implied. Language identiﬁcation is done for individual terms based on the trigram technique developed previously at Rank Xerox Research Centre (RXRC) France as described by [4]. Since we identify language only for individual terms we have only analyzed trigrams composed of alphanumeric characters, i.e. omitting spaces and puctuation characters. The initial language set contained English and Swedish which were the languages that we needed to distinguish between, where corresponding frequency tables were derived from a text corpus based on the Electronics Texts Center Collections. The Electronic Text Center’s holdings include approximately 70,000 on- and oﬀ-line humanities texts in thirteen languages, with more than 350,000 related images (book illustrations, covers, manuscripts, newspaper pages, page images of Special Collections books, museum objects, etc.) For our purposes we selected and analyzed ca. 3 Million english words. http://etext.lib.virginia.edu/. Frequency tables were optimized by selection of characteristic (most frequent) trigrams. As pointed out, in our case we only distinguished between two languages – English and Swedish. Correlation of most frequent trigrams in corpus with the word trigrams was computed as Cramer’s Phi (V) coeﬃcient based on the chi-square statistic. 3.2 Textual Similarity Computation Textual similarity computation is based on coeﬃcients that are traditionally used in the ﬁeld of IR. We considered a variety of similarity measures, namely the Cosine similarity measure as well as the Jaccard, modiﬁed Jaccard and Dice coeﬃcients. The Cosine similarity is currently used as the principal measure: Wiq × WiD sim(q, D) = 2 × 2 Wiq WiD We have used boolean weighting of document and query vectors. In this scheme the weighting is applied as an additional feature selection ﬁlter removing features that do not score well enough in terms of the tf.idf measure: N tfi × logN wiD = tfmax di The feature selection can either be done so N best features are kept or, since wi is normalized, by setting a tf.idf threshold. We also started to experiment with various weighting schemes that could increase the feature selection performance, such as the Okapi BM25 weighting [5] and its variants.

218

M. Rajman and M. Vesely

Keyword Relevance Individual keywords in the user query and the document proﬁle have diﬀerent contribution to the computed query-document similarity. That is, we observe how much does the computed query-document similarity changes when particular keyword is left out from the query. More precisely, we compute the query keyword relevance to a document or a document category where the document category proﬁle may be composed as a sum of all underlying document proﬁles or equalled to the most representative underlying document proﬁle. The contribution to the similarity may be negative when the query keyword does not appear in the document at all, lowering the computed similarity5 Negative contributions of keyword to the similarity are understood as zero keyword relevance. Positive contributions of keyword to the similarity then are normalized. We have experimented with the following approaches computing the contribution of individual keywords: (i) relative contribution of keyword to similarity, (ii) absolute contribution of keyword to similarity, (iii) similarity of keyword to document. We used the relative contribution of keyword to similarity that has been computed as follows: KR(k) = 0 if sim(q, D) < sim(q\{k}, D) sim(q\{k}, D) otherwise KR(k) = 1 − sim(q, D) Alternatively, the similarity of keyword to document could be used directly providing a more discriminative measure: KR(k) = sim({k}, D) In a simple example the contribution of each of the relevant keywords in q = {1, 1, 1, 0} to the Cosine similarity to the document represented by D = {1, 1, 1, 1} are the following: KR Relative keyword relevance Similarity of keyword to document

5

KR{k} 0.23 0.58

For example having q = {1, 1, 1} and D = {1, 1, 1} we have sim(q, D) = 1.00 (we refer to the Cosine similarity unless speciﬁed otherwise), whereas adding irrelevant keyword to the query q = {1, 1, 1, 1} at D = {1, 1, 1, 0} we have sim(q, D) = 0.87. The contribution of the added keyword k to the similarity is then −0.13. For us its relevance KR(k) is therefore 0.

MIR Techniques for Enhanced Access to Statistical Data on the Web

219

3.3 Automated Navigation Automated Navigation Algorithm The principle of the automated navigation algorithm is based on the modiﬁed Input/Output interpreter described in [6]. It computes a score distribution on the targets in the tree allowing the most relevant node corresponding to the user query to be identiﬁed anywhere in the hierarchical structure. This approach allows to skip several levels of the hierarchy and faster navigation based on a given arbitrary threshold. At crucial nodes, the user assistance is required to conﬁrm the following path interactively. Minimum best score difference and the information theoretic approach based on information entropy have been suggested as criteria for decision on automated navigation. Minimum Best Score Diﬀerence The minimum best score diﬀerence seeks for a mathematical formulation of the rule that automated selection should be triggered if some of the scores of nodes in selection is “good enough” and if this score is “substantially better” than the other ones. Let s1 (resp. s2 ) be the best (resp. second best) score of node in selection and dmin be the minimum best score diﬀerence required. The node associated with s1 is automatically selected if: s1 ≥ smin and 1−

s2 s1

≥ dmin

Alternatively we have also experimented with the absolute best score difference that allows a more conservative approach to automated navigation. In particular, for low values of the s1 (particularly when s1 < dmin ) this approach may be preferred. The rule of absolute minimum best score diﬀerence then requires that s1 − s2 ≥ dmin holds. Values of smin and dmin are selected arbitrarily, we have achieved good results by working with smin ∈< 0.1, 0.25 > and dmin ∈< 0.33, 0.67 >. Information Theoretic Approach In this approach not only the two best scores are compared but scores of all nodes are taken into consideration for the decision whether automated navigation will take place or not. The information entropy is calculated based on the probabilities derived from the obtained scores. Probabilities pi are curently derived from similarities si as follows: si pi =

s

220

M. Rajman and M. Vesely

The entropy is then computed and normalized based on these probabilities: ln(pi ) H=− pi × ln(N ) Where N is number of nodes in selection. The automated selection then takes place when the entropy H does not exceed a speciﬁed threshold k, i.e. when the uncertainty of making a good automated choice is not too high. We have achieved good results by working with threshold k ∈ (0.85, 0.95). When si = 0 the node is ignored and does not enter the computation. This rule may be extended to si < smin introduced in previous section. Creation of Hierarchical Structure In order to allow automated navigation, creation of a coherent hierarchical structure is necessary. The hierarchical structure can be created for example by clustering or categorization of extracted data items, however, in this case the human-readable descriptions for intermediate nodes would have to be created which corresponds to a diﬃcult task. We have therefore opted for mirroring the existing structure on the SCB Web site, transforming the graphlike structure into tree-like hierarchy. The SCB sitemap has been analyzed and became a basis for the website structure extraction step6 . The 23 subject domains have allowed us to produce quite a broad data set with regard to the subject categories covered. As far as content is concerned, domains are clearly identiﬁed, whereas sub-domains needed to be extracted from the domain related documents (sub-domains are not explicitly present in the web site structure), and from the output format of the access forms to the publication database, the press archive, and the statistical database. In order to allow consistent graph-to-hierarchy conversion, identiﬁcation of a key attribute as a base for the tree-like hierarchical structure composition is necessary. Since our target was to address domain speciﬁc data, the choice for this attribute was a content-related subject. Alternatively, in case there would be more then one key attribute associated with the data domain, several hierarchical structures (multiple trees) could be created in parallel. Other attributes, such as document types and time/space relevance as described in previous section, were regarded as metadata allowing the document ﬁltering functionality. In the prototype we also tried to keep documents with the same granularity on the same level. Ideally the tree would be binary and deep. This is unfortunately not realistic with regard to the nature of data and even with our attempt to keep the tree as close as possible to this vision we arrived at 5 levels with average branch factor of 9.05. It is perhaps interesting to also mention the average branch factors for individual levels (values have been adjusted with respect to leaves occuring in various levels): 6

http://www.scb.se/templates/SiteMap 2711.asp

MIR Techniques for Enhanced Access to Statistical Data on the Web Level

Average Branch Factor

1 2 3 4 5 all levels

23 7.17 8.45 9.29 N/A 9.05

221

Description Root Subject domain Subject sub-domain Portal or Statistical message Statistical document or item

Backward Navigation Navigation is done from the tree root to the most relevant leaf node or set of leaves. However, there is a possibility that user will “get lost” by not precise enough formulation of a query or by selection of a wrong node within the navigation process. In such cases it would be useful to be able to automatically go back up in the tree when the system recognizes signiﬁcant diﬀerences in the relevance scores. In order to implement this idea two possibilities were considered: (i) changing the sub-tree and (ii) alternative selections. In order to allow this functionality two-level similarity computation was applied. First, on the level of a local sub-tree related to the actual position in the hierarchical structure, second, the global similarity for all existing nodes outside the current sub-tree. If the similarity of the reﬁned query to some node in the global structure proves to be signiﬁcantly better than to the best local node, the navigation will allow the path correction by alternatively considering the other path to be explored in parallel or instead. Changing the Sub-Tree A more pro-active approach is to automatically decide in place of a user to opt for an alternate suggestion anticipating signiﬁcantly better results in the subsequent information access steps. The best score of the whole tree is compared with the best score of the current sub-tree. In case the two scores diﬀer the change of the current position in the structure is considered. Again, as it is the case for the automated selection algorithm described previously, we compare the diﬀerence of the two scores with arbitrarily selected threshold. Note that this option may cause frequent jumps in the tree that may be perceived as an inappropriate behavior of the system by users. We considered the usage of this approach where the search history is included in the computation. That is, the jump is only triggered when the similarity of the global tree with the conjuction of queries in the search history, appropriately weighted retrospectively, is signiﬁcantly better than the one computed on the local sub-tree. Alternative Selections A more conservative implementation of this functionality is to only suggest the better scoring nodes as an alternative selection. This way the user interface

222

M. Rajman and M. Vesely

is more coherent in a way that user is not faced with unexpected changes in position in the structure.

4 Conclusion The developped StatSearch prototype is currently undergoing a user-based evaluation. The evaluation methodology was jointly set up by Statistics Sweden (SCB), EPFL, and CERN. The concrete evaluation experiments took place at SCB in early 2005 and will be full reported in [1].

References 1. Ing-Mari Boynton, Bert Fridlund, Alf Fyhrlund, Peter Lundquist, Bo Sundgren, Martin Rajman, Martin Vesely, Helge Thelander, and Martin Martin W¨ anersk¨ ar. Evaluating a system for enhanced access to statistical data on the web: the statseach evaluation experiments. To appear in the Proc. of the HCI 2005 internaltional conference, Las Vegas, USA, 2005. 2. Forler E. Intelligent user interface for specialized web sites, Master thesis, Ecole Polytechnique Federale de Lausanne, 2000. 3. Fridlund Bert Fyhrlund Alf and Sundgren Bo. Using text mining in oﬃcial statistics. COMPSTAT 2004, 16th Symposium of IASC, Prague, August 23–27 2004. 4. Grefenstette G. Comparing two language identiﬁcation schemes. JADT’95, 3rd International conference on Statistical Analysis of Textual Data, Rome Italy, 1995. 5. Robertson S.E. and Spark-Jones K. Relevance weighting of search terms. Journal of the American Society for Information Sciences, 27(3):129–146, 1976. 6. Guhl U. Entwicklung und Implementierung eines UNIX-Assistenten. PhD thesis, Rheinisch-Westfalische Technische Hochschule Aachen, 1995.

Comparative Study of Text Mining Tools Antoine Spinakis and Asanoula Chatzimakri QUANTOS SARL, 154 Sygrou Av., 176 71 Athens, Greece [email protected]

Abstract. In this paper is presented the overall process and the basic conclusions of a comparison study, which was applied in the framework of NEMIS project regarding text mining tools. The basic stages of the overall comparison process are described, together with the speciﬁed evaluation criteria. Finally, the main conclusions of the particular study constitute the last chapter of the paper.

1 Introduction It is widely accepted that there is a continuously increasing demand for more and better eﬃcient Text Mining solutions, able to tackle a majority of diﬀerent situations and problems. Nevertheless, there is an important gap between the needs of the users and what is available in the market. A comparison study, which was conducted within the framework of NEMIS project, reviewed 14 Text Mining in point of their technical and methodological characteristics. The overall comparison process was implemented in two basic phases and ﬁnally conclusions about further work and development were derived. In the following sections a brief overview of Text Mining ﬁeld and the objectives of the comparison study are presented. In the sequence the evaluation process and the basic conclusions, which, are mainly related to lacking features and further work within the framework of Text Mining, are presented.

2 Overview of Text Mining Field & Objectives of the Comparison Study Text mining is an interdisciplinary ﬁeld involving information retrieval, text understanding, information extraction, clustering, categorization, visualization, database technology, machine learning and data mining. Text mining is a challenging task since it involves dealing with data that are unstructured and fuzzy. Through text mining there is the possibility to analyse and structure B. Drewes: Some Industrial Applications of Text Mining, StudFuzz 185, 223–232 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

224

A. Spinakis and A. Chatzimakri

large sets of documents applying statistical and/or computational linguistics technologies. Although Text Mining and Data Mining are related as they are mining processes they diﬀer in point of the following issues: 1. Text mining deals with unstructured or semi-structured data, such as text found in articles, documents, etc. However Data Mining is related to structured data from large databases. In addition, another characteristic of text mining is the amount of textual data. The concepts contained in a text are usually rather abstract and can hardly be modelled by using conventional knowledge representation structures. 2. Furthermore, the occurrence of synonyms (diﬀerent words with the same meaning) or homonyms (words with the same spelling but with distinct meanings) makes it diﬃcult to detect valid relationships between diﬀerent parts of the text. Text mining techniques enable to discover and use the implicit structure of the texts (e.g. grammar structure) and they usually integrate some speciﬁc Natural Language Processing (Corpus Linguistics). Text mining techniques can range from simple one (e.g. arithmetic averages) to those with intermediate complexity (e.g. linear regression, clustering and decision trees) to highly complicated ones such as neural networks. Among the most important text mining tasks are document clustering and text summarization. The basic idea in clustering is that similar documents are grouped together to form clusters. The same procedure can be followed for clustering terms instead of documents. Then, terms can be grouped and form classes of co-occurring terms. Co-occurring terms are usually relevant to each other. This grouping of terms is useful in automatic thesaurus construction and in dimensionality reduction. Automatic thesaurus construction is based on statistical criteria and thus it is conceptually identical with the document clustering methods. Several cluster techniques can be adopted such as hierarchical clustering or k-means cluster analysis. Text summarization usually consists in producing summaries that contain not only sentences that are present in the document but also new automatically constructed phrases that are added to the summary to make it more intelligible. Other times are only restricted in extracting the most relevant phrases from a document. Within the modern society Text Mining constitutes a process that is necessary in many ﬁelds of the modern society. Some of those application areas are the following: • Marketing: Discover distinct groups of potential buyers according to a user text based proﬁle • Industry: Identifying groups of competitors web pages

Comparative Study of Text Mining Tools

225

• Job seeking: Identify parameters in searching for jobs • Market Analysis e.g. a marketer gathers statistics on the occurrences of words, phrases, or themes that will be useful for estimating market demographics and demand curves. • Customer relationship management-CRM e.g. mining incoming emails for customers’ complaint and feedback. • Human resources-HR e.g. mining a company’s reports and correspondences for activities, status, and problems reported. • Patent analysis & Technology Watch e.g. analyzing patent databases for major technology players, trends, and opportunities. • Information dissemination e.g. organizing and summarizing trade news and reports for personalized information services. The basic aim of the review was to provide detailed information for a variety of text mining tools and also to provide information about the types of support that each tool can oﬀer to potential users. An additional objective was to identify areas of further work through the comparison of the selected tools in the level of technical and methodological characteristics.

3 Description of Evaluation Process As it was mentioned in the Introduction the comparison process was composed of two phases. The basic stages of this comparison study are illustrated in Fig. 1. The 1st Phase of Comparison Study concerned the preparation of the whole process and was composed of 3 stages. Firstly, the text mining tools that should be compared were selected according to particular criteria. Then a brief description of those tools was presented. The Text Mining tools that were selected for the comparison were the following: 1. 2. 3. 4. 5.

ALCESTE 6. Lexico ATLAS.ti 7. NUD*IST Hyperbase 8. SAS Text Miner IBM Intelligent Miner for Text 9. Spad Intex 10. Sphinx Lexica

11. 12. 13. 14.

SPSS STING Technology Watch from IBM Temis on Line Miner

The last stage of the 1st Phase of the Comparison Study was related to the construction of the comparison criteria. The comparison of Text Mining tools

226

A. Spinakis and A. Chatzimakri

Fig. 1. Overview of Comparison Process

was performed on the basis of six criteria. Each one of those criteria is related to particular aspects of Text Mining tools. Moreover, the comparison was conducted in various levels that do not only examine the software strictly from a theoretical or methodological point of view but also from a wider aspect such as their ability to respond in a variety of functionalities. Technical characteristics were also taken into consideration so as to have a more complete view of the Text Mining Tools in technical, methodological and functional level. In more detail the categories of criteria that were applied in the comparison process are the following: 1. 2. 3. 4. 5. 6.

Technical Characteristics Data Management Process Methodologies used for Text Mining Output Model Visualization Methods Automation

3.1 Technical Characteristics The technical characteristics are related to parameters such as: • Operational Environment of the system • Data sources that provide the Text Mining tool with data • Web Accessibility of the system.

Comparative Study of Text Mining Tools

227

The operational environment constitutes a basic characteristic for all types of software. Hence, the ability of Text Mining tools to be adaptable from various operational systems increases their adaptability. In addition the data sources for each text mining tool can be considered as one of the basic parameters that provide an overall view of the system’s identity and area of application. Furthermore, the variety and the type of data sources determine the ﬂexibility of the text mining tools to process various types of textual data. Finally, web accessibility is also considered as an important technical criterion. The Web access of a text mining tool can be beneﬁcial in point of data retrieval, collaborative tasks and also for end users that might have the opportunity to beneﬁt from text mining analysis through the web. 3.2 Data Management Process Within the area of Text Mining the data management is a process of major importance. The management of the textual data so as to be appropriate for mining constitutes a process applied in various levels. Within the framework of the comparison, the text mining tools are examined in point of 5 Data Management Process criteria, which are the following: • • • • •

Data Types for Import Linguistic Processing/Text Coding Creation of Vocabulary Sorting and Filtering Multilingual process of data

The criterion Data Types for Import is related to the type of textual data that can be imported in the text mining tools. Linguistic Processing constitutes a basic component within Text Mining once it cleans the data from irrelevant characters, performs lemmatization and identiﬁes the morpho-syntactic categories of the words. The Vocabulary supports the analysis and the mining process on textual data that are pre-selected according to various criteria. In addition Sorting & Filtering enable the ﬂexible manipulation of data. Finally, another comparison criterion is the ability of the Text Mining tools to support textual data in various languages. 3.3 Methodologies Used for Text Mining The third category of comparison is related to the methodologies that textmining tools apply during the textual analysis and mining process. Although, a variety of methods can be applied we have predetermined 5 individual categories of methodologies. Those methods vary from simple statistics to advanced multidimensional techniques. Apart from the predeﬁned methodologies we recorded additional methods that the selected tools applied so as to have a complete view of their capabilities. The methodologies that form the criteria within “Methodologies used for text mining” category are the following:

228

• • • • • •

A. Spinakis and A. Chatzimakri

Word or Segment Analysis Descriptive Statistics Correspondence Analysis Factor Analysis Cluster Analysis Other (not predeﬁned methodologies)

3.4 Output Model An additional comparison category refers to the output models that text mining tools produce. The term “output model” refers to the diﬀerent ways of various result presentations at any stage of the analysis or exploration. The basic levels for the output model are Tables and Reports. However, special features of the tools in relation to the output model have also been recorded and displayed in the list of output model criteria. 3.5 Visualization Methods The selected Text Mining tools are also compared in relation to the visualization options they provide. The term “Visualization” concerns the construction of a visible presentation of numerical data, particularly a graphical one. This might include anything from a simple X-Y graph of one dependent variable against one independent variable to a virtual reality with animation techniques. The basic criteria for visualization are four and there is also a category, where special graphs of each text-mining tool are referenced. Within “Visualization methods” exist the following categories: Pie Charts, Line Plots, Histograms, Bar Charts, Other (special graphs of each text mining tool) 3.6 Automation The last category of comparison criteria is “Automation” and concerns functionalities of the text mining tools that are executed automatically or functions that support the automated execution of particular actions. Predeﬁned criteria for that category have not been determined. For each text-mining tool that fulﬁlled the characteristic of automation, were recorded the functionalities that support or constitute automated processes. The 2nd Phase was related to the actual comparison process. For each comparison category the characteristics of the text mining tools were presented. In addition, the text mining tools were compared according to their characteristics within each one of the six categories.

Comparative Study of Text Mining Tools

229

4 Comparison’s Study Basic Conclusions and Issues for Future Investigation Within the scientiﬁc area of Text Mining the users have in their disposal a considerable variety of tools. Each one of these tools supports text mining to a greater or lesser extend, by providing means for managing textual data, exploring the behavior of particular words or segments, applying simple or advanced analysis methods and ﬁnally presenting the information in reports, tables or illustrating that in graphs. The fact that there is a considerable large amount of text mining tools might lead to the conclusion that any user interested in text mining has an unlimited number of choices of text mining tools. However, there are still text mining tools, which have to be developed further or methodological approaches that have to be integrated into text mining systems. In general, the reviews that have been conducted so far indicate that a single text mining tool alone supports only some of the operations required for a particular text mining project. Unfortunately, there is still lack in the integration of text mining tools in point of functionalities such as text import, data management, exploration, analysis and presentation of results. The lack is mainly identiﬁed in the combination of functionalities through out the whole process of text mining process. The discussion about commonalities or diﬀerences among text mining tools is organized according to the following topics: Technical Characteristics, Data Management operations, Methodological approaches, Output models, Visualization techniques and automated methods. 4.1 Technical Characteristics Although one might think that text mining process is mainly related to data management, exploration, analysis and results presentation, there is also a basic component of that process. That component is the technical environment under which software operate. The operational systems that the compared text mining tools require do not diﬀer from tool to tool. The dominant operational system is Windows 95 or 98 and later versions. However, there is no homogenizing in data sources from which text mining tools import textual data. Some tools are specialized in particular data such as Sting that applies patent data analysis and accepts data from ESPACE Databases. Other tools are specialized in analyzing data such as open ended responses, interviews or any other types of texts. Those tools (Spss, SAS Text Miner or Hyperbase) cannot be used for special types of textual data that have a particular structure. It is possible for these tools to manage special forms of data once, various management or transformation operations are ﬁrstly applied. In addition, the lack of web based capabilities for the majority of the selected text mining tools is obvious. Only three tools (IBM Intelligent Miner for Text, SAS Text Miner and Temis on Line Miner) exploit the advantages of web capabilities. The

230

A. Spinakis and A. Chatzimakri

development of web based text mining tools constitutes an issue of further development within the area of text mining. 4.2 Data Management Operations It is commonly accepted that data management process is one of the basic stages within any advanced or simple analysis process. The behavior of the compared text mining tools is satisfactory in relation to data management operations, although there are still limitations for some systems. For example tools that apply patent data analysis do not provide mechanisms for import of various textual data formats. Technology Watch from IBM and Sting require particular textual data with speciﬁc structure. Other softwares such as Temis on Line Miner do not seem to have ﬁltering or sorting capabilities, although they have advanced functionalities in other levels, such as import of multilingual textual data. Linguistic Processing is applied by the majority of the text mining tools in a satisfactory degree. What diﬀers among the compared systems seems to be the usage of dictionary. Some tools use dictionaries as support tools in linguistic processing, whether other tools such as Spad and Sting support the generation of vocabularies that are only composed of the words or segments for analysis. None of the listed tools provide both types of dictionaries. Finally, the issue of multilingual operations can be considered as an area of further development, since only 5 tools (Alceste, SAS Text Miner, IBM Intelligent Miner for Text, SPSS and Temis on Line Miner for Text) support the analysis of textual data that come from languages other than English or French. 4.3 Methodological Approaches In relation to the methodological approaches there are no variations among the text mining tools. Cluster analysis and simple statistics constitute the methods that are applied by the majority of the tools. Correspondence and Factor Analysis are also applied but not at an extend level. However, other methodologies such as neural networks or decision trees do not appear as applied methodologies by the selected tools. Tools like Alceste, Sting and Lexico3 have a more broad functionality of methodologies. In more detail, Alceste applies “Analysis Tri-coise” in order to cross the text with several identiﬁers. Sting uses Bootstrap as a validation criterion of Correspondence Analysis. On the other hand Lexico3 applies Textual Time Series Analysis and Spss applies Predictive Modeling. According to the above, one realizes that further study within the domain of methodological approaches in text mining and search of alternative solutions might be necessary. 4.4 Output Models Although each one of the selected text mining tools can be used for a diﬀerent purpose within the area of text mining, their output models do not diﬀer. The

Comparative Study of Text Mining Tools

231

only tools that provide an additional option in reporting are Alceste, Lexico3, Sting and Temis on Line Miner. The additional functionalities that the above mentioned tools provide are related ﬁrstly to the option of dynamic insertion of comments within the generated report and secondly to the fact that the users can send an object (table, graph) to the report at any stage of the analysis. 4.5 Visualization Techniques Visualization constitutes an independent scientiﬁc domain that mainly corresponds to the need of the users to support data exploration. The listed text mining tools support mainly Primitive visualization techniques such as Bar Chart, Histogram, Line Plot & Pie Charts. In addition Alceste provides to the user the choice to use animated graphs. Finally, Cluster analysis diagrams and interactive cluster maps are available from the majority of the selected text mining tools. However, as it has already been mentioned, Sting and Temis on Line Miner can be considered as the most pluralistic text mining tools in the provision of visualization options. 4.6 Automated Methods In terms of the automated methods, the basic eﬀorts seem to focus on the creation of automated import/export mechanisms and the creation of outputs mainly in html and excel formats. Atlas, Nudist, Spad, Sphinx, Spss and Sting are the tools that support automated functionalities for import or export and for the generation of output. However, Dictionary Modiﬁcation, Log information and command line syntax are not supported at the same degree as the above mentioned functionalities. The only tools that provide automated processes for dictionaries are Sphinx and Alceste. History and Log are supported only by Spss, Spad & SAS Text Miner. Finally opportunities for programming in terms of coding by the user are available by Nudist and Spss. Hence, although the construction of code and the execution of several procedures through programming is a common characteristic of many statistical packages, in text mining tools seem to be still in an initial stage of development.

5 Conclusions In general, the functionalities oﬀered by each of the reviewed software may cover fully the needs of a given text analysis project. The suggestion for the implementation of single software that is suitable for all types of text mining projects does not yield as a conclusion of the comparison study and we do not also propose the development of uniform operations for all text-mining

232

A. Spinakis and A. Chatzimakri

tools. Each system serves speciﬁc objectives and has its own identity. However, some development standards such as textual data management or export formats might be useful to be applied. In addition, the adaptation of standard terminology among the existing text mining tools could be a step of further improvement. The basic aim of this comparison study has been the description of text mining tools in relation to standard technical and methodological functionalities, so as to provide information for the support that each tool can oﬀer to the users. In addition, the overall comparison contributed to the identiﬁcation of areas for further work and development. Although, it was made an eﬀort to record and compare the characteristics of a vast variety of text mining tools, we know that some additional tools should have been included in this review. Unfortunately, the basic reason for not including some systems in the study was the time limitations to the search of further information for some systems. We hope that omissions of this type will not occur in feature reviews and comparison studies, mainly because we lose valuable information about tools that were developed under strong eﬀorts from people with special interest and broad knowledge in the domain of text mining.

References 1. Antonis Spinakis. Market survey-Comparative analysis of TM tools 2. F. M. Capo, F. della Ratta-Rinaldi. An overview on Text Mining applications, NEMIS Annual Conference on “Text Mining for Business Intelligence”, Universit` a di Roma “La Sapienza” – Roma – 23 January 2004 3. Jan van Gemert. Text Mining Tools on the Internet – An overview 4. John F. Elder IV & Dean W. Abbott Elder Research. A Comparison of Leading Data Mining Tools, Fourth International Conference on Knowledge Discovery & Data Mining, August 28, 1998, New York 5. Le Sphinx Developpement. SphinxSurvey-Overview 6. Lyn Richards. NUD∗IST 4, Introductory Handbook, 1998 7. Max Silberztein. INTEX,Universit´e de Franche-Comt´e, 1997–2004 8. Melina Alexa & Cornelia Zuell. A review of software for text analysis 9. Semio Corporation. Text Mining and the Knowledge Management Space 10. Thomas Muhr, Susanne Friese, Quarc Consulting. User’s Manual for ATLAS.ti 5.0, 2nd Edition – Berlin, June 2004 ´ 11. Yasmina Quatrain, Nugier, Anne Peradotto, Damien Garrouste. Evaluation d’outils de Text Mining: d´emarche et r´esultats

Some Industrial Applications of Text Mining Bernd Drewes SAS Institute, Heidelberg [email protected]

Abstract. Three industrial applications of text mining will be presented requiring diﬀerent methodologies. The ﬁrst application used a classiﬁcation approach in order ﬁlter documents relevant for personal proﬁles from an underlying document collection. The second application combines cluster analysis with statistical trend analysis in order detect emerging issues in manufacturing. In the third application a combination of static term indexing and dynamic singular value computation is used to drive similarity search in a large document collection. All of these approaches require a knowledgeable human to be part of the process, the goal is not an automatic knowledge understanding but using text mining technology in order to enhance the productivity of existing business processes.

1 Proﬁling and Classiﬁcation of Scientiﬁc Documents Classiﬁcation is a standard paradigm and can be applied to a document collection in order to match candidate texts to a personal or professional interest proﬁle. A text mining-application for pharma customers has been built for the automated identiﬁcation and ranking of scientiﬁc articles. This so-called “topic scoring engine” [2] is based on the SAS Text MinerTM . The topic scoring engine identiﬁes documents with similar contents and creates search-proﬁles which will comply with the congruencies of the documents. This project is supporting the work of scientists for gathering information in highly speciﬁc research topics and will also allow users on a world-wide basis to access new scientiﬁc information much more quickly. In the past, scientists had to comb through many articles with highly technical subject materials, in order to occasionally ﬁnd articles of interest. In one setting, on the average 15–20 relevant new scientiﬁc papers were found per year, for each topic and scientist through the existing manual review method. This relatively low number exempliﬁes the need for automatizing the classiﬁcation of documents into categories. It is an increasingly important task, scientiﬁc document collections continue to grow at exponential rates and the task of retrieving and classifying appropriate documents by hand can become unmanageable. In fact B. Drewes: Some Industrial Applications of Text Mining, StudFuzz 185, 233–238 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

234

B. Drewes

it is becoming impossible now to follow a scientiﬁc ﬁeld by manual methods alone. The topic scoring engine replaces keyword querying of bibliographic databases such as PUBMED with a structured automated process by means of “document based retrieval”. This reduces research time while improving the quality of the results. The outstanding feature of the topic scoring engine is that it does not look for pre-deﬁned vocabularies like a search engine. Instead the tool uses many diﬀerent types of singular value decompositions with different information measures and resolutions of the concepts underlying the text. From these, topic speciﬁc variable sets are generated using iterative applications of correlation analysis, variable clustering, and variable selections. Finally, classiﬁcation rules are generated using regression analysis. Each of the classiﬁcation models then eﬀectively deﬁnes a search proﬁle for its topic These proﬁles are subsequently applied as ﬁlters to new publications (Fig. 1). This allows the user to seek publications matching these proﬁles

Fig. 1. User proﬁles are deﬁned by providing a set of sample documents, called Motifs. These are used to train classiﬁcation models which are run against new documents. Users are notiﬁed by E-Mail about newly available documents for their proﬁle and may provide corrective feedback, leading to an improved sample set and an improved model. The proﬁle deﬁnition process can be performed over the web

Some Industrial Applications of Text Mining

235

without having to submit complex queries. Furthermore, users can receive weekly or even daily updates about the relevant new publications and research topics. Thus scientiﬁc literature research is rendered much more convenient. The topic scoring engine also helps to overcome the barrier of false or mismatching keywords. This technology is changing the way new scientiﬁc articles of interest are identiﬁed. Search results from the Medline database are scored overnight for their likelihood to address one of the topics within the expertise of the research group. Search results are also stored in a content browser which allows easy identiﬁcation of similar articles, as well as the identiﬁcation of concepts which distinguish between articles. These results are provided to scientists each morning, substantially enhancing their productivity, while also reducing the costs of ordering full papers. In terms of “knowledge mining”, implicit content keys (weighed linear combinations of terms) are used deliver a targeted document set, rather than attempt to extract and represent the knowledge itself. This can be applied to many diﬀerent areas, ﬁltering for large number of users whatever their interest proﬁle demands. Integrating the documents that belong to any one proﬁle into a coherent knowledge structure, transcending the actual documents would be a next step in the not so distant future. In summary, the main beneﬁts of the Topic Scoring Engine (TSE) are as follows: • It ﬁlters the thousands new abstracts per day and retains only what is valuable to the database, i.e. documents falling within the deﬁned topics • TSE frees human resources from querying and speeds up literature queries • TSE provides automatic information on rapid literature developments in any of the topic proﬁles • The customer’s scientiﬁc databases will better reﬂect the actual available literature in Pubmed/Medline, thus giving it a competitive advantage and attracting more scientists around the world • the topic scoring engine goes far beyond key words: even relevant abstracts of papers that have been assigned by accident to the wrong or a misleading key word will be found

2 Warranty and Early Warning The timely detection of warranty defects has huge ﬁnancial implications. Faulty products remaining on the market can cost the manufacturer many millions of dollars in warranty expenses and recall actions. Additional millions if not billions may be at stake through lawsuits, negative publicity and lost brand images. A case in point is the National Highway Traﬃc Safety Agency in the United States. Here, customer complaints regarding safety related vehicle defects and crashes are collected and investigated. When a certain number of

236

B. Drewes

complaints on particular issues have been received, the agency needs to investigate whether this is a signiﬁcant trend due to some underlying product problem. Defects must be repaired free of charge to the customer and the agency is empowered to ask, and if need be order, the manufacturer to conduct a recall, if warranted by the problem. The data gathered are automobile related information as well as descriptive text information on damages and accidents. Manufacturers are usually motivated to uncover potential problems as soon as possible, making this a candidate for automated problem solving support, in this case the application of data and text mining. SAS has created a warranty solution in which warranty data are integrated with key customer, product, manufacturing and geographic information [1]. This solution allows customers to identify questionable warranty claims (i.e. detect fraud), forecast warranty costs, and detect emerging issues quickly. Text mining can be of help in the latter endeavor and is used with a clustering paradigm. Customer complaints, repair notes, and posted web information are subjected to topical clustering and are statistically monitored for trend analysis. Alarms are generated if incident increases beyond threshold and are statistically signiﬁcant. Tangible beneﬁts have been produced for some customers, identifying serious issues substantially earlier than through regular business operations. Early warning systems are not just relevant for manufacturing. The analysis process can also be applied to such disparate areas as legislation and ﬁnancial markets. When introducing new legislation, rulings or policies, areas requiring clariﬁcation can be identiﬁed from an increased rate of inquiries. An early warning system can pinpoint such areas, and enable a focusing of existing resources for clariﬁcation rulings, rather than using resources to settle all individual inquiries separately. Similarly in ﬁnancial markets trends analyzed from business news can be incorporated in ﬁnancial models, e.g. see [3]. In terms of knowledge mining, this approach is monitoring the quantitative behavior of knowledge in narrowly deﬁned areas, as covered by the diﬀerent issues in warranty management. The signal triggering business actions is not determined by a content interpretation of the collection of repair notes, but simply by a trend forecasting an area in need of attention. In addition to suitable business action following from attending to this area, new target areas of content may be deﬁned by segmenting the area under consideration, thereby also deﬁning highly business speciﬁc knowledge ﬁelds to be monitored. This is also an example where text mining has been integrated into an existing industry solution which is likely to become one of the trends of the future.

3 Patent Analysis As a ﬁnal text mining application, a currently emerging system for locating patents based on similarity searches will be brieﬂy sketched. The main

Some Industrial Applications of Text Mining

237

challenge is the sheer size of the problem: many million documents, each with substantial length in a patent database may need to be searched by users wishing to investigate patents similar to a given or intended one. The search will be based on query words identiﬁed and explicitly listed in each patent document but will not be restricted to a key word based search approach. Instead, the query words will be used for a clustering of patents as a ﬁrst step, and the resulting clusters will each be processed by a “singular value decomposition”, a statistical process of generating linear combinations of words suitable for a discrimination between documents. As part of this process, word similarities from common occurrences in the same document are learned. The similarity between words are computationally expressed by their Euclidean distance between their singular vectors. Likewise, similarity neighborhoods between documents can be computed as Euclidean distances between the singular vectors of the documents. New patents are ﬁrst matched to the appropriate cluster based on the query terms and are then “scored” by the text mining model for that particular cluster in order to impute its singular vectors. Using the neighborhood function, patterns most similar to it can be retrieved in a ranked fashion and inspected. Intuitively, the clusters correspond to diﬀerent industries/application domains of the patents. The clustering approach was chosen in order to overcome the otherwise prohibitively large memory and processing requirements that would be incurred by a patent collection of many million documents. A similar approach can also be applied to other large document collections. In terms of knowledge mining this is a modest step in the patent area, dealing mostly with managing the complexity introduced by the size of the data collection. The system presupposes a user specifying the kind of knowledge that is supposed to be found (essentially providing a patent description), rather than the system advising the user about its contents, which would be a large task in any case. Identifying areas of promise, i.e. those with a low concentration of patents could be one non-standard but challenging task for a skilled user.

4 Conclusion There are many other industrial applications that could have been presented in this paper, these include survey analysis, analysis of complaint letters, CRM in call centers, codiﬁcations, analysis of adverse drug outcome, analysis of competitive intelligence, analysis of similarities and diﬀerence between political programs, recommendation systems, claims fraud detection, security analysis. The case studies above were chosen because they exempliﬁed diﬀerent text mining approaches: classiﬁcation agents in the case of the pharmaceutical research literature, clustering combined with statistical analysis for early warning systems, and predictive modeling of clustering scores for similarity searching in huge databases. Text Mining has certainly reached the industrial

238

B. Drewes

world and is exploiting knowledge that due to its sheer size is sometimes or often beyond human consumption.

References 1. H¨ one, Reinhard, SAS EMEA, Heidelberg: Personal Communication. Reinhard.H¨ [email protected] 2. Reincke, Ulrich, SAS Germany, Heidelberg: Personal Communication. Ulrich. [email protected] 3. The Intertek Group, Management Report on LEVERAGING UNSTRUCTURED DATA IN INVESTMENT MANAGEMENT, 94, rue de Javel F-75015 Paris,www.theintertekgroup.com http://www.theintertekgroup.com/Intertek-TM-MngmntRpt.pdf.

Using Text Mining Tools for Event Data Analysis Theoni Stathopoulou Institute of Political Sociology, National Centre for Social Research, Athens, Greece

Abstract. This paper concerns itself with the analysis of event data with text mining tools. The methodological approaches to event data analysis are presented, and an analysis is performed using SPAD Software and SAS Text Miner. Finally, some conclusions are drawn concerning the use of text mining tools for event data analysis.

1 Introduction The European Social Survey (ESS) is a biennial survey conducted in over 20 nations. Its aim is twofold: “to monitor and interpret public attitudes and values within Europe and to investigate how they interact with Europe’s changing institutions, as well as to advance and consolidate improved methods of cross-national survey measurement in Europe and beyond”1 . To this end, an event database was designed as a supplement to the questionnaire; each country had to collect, both before and during ﬁeldwork, important events as recorded by the media (national newspapers, and television and radio broadcasts) that could have exerted inﬂuence on responses to certain questions. The basic aim of event data collection is to record and evaluate the impact of historical circumstance, as reported by the media, in the shaping of attitudes. The ﬁrst round of ESS produced a rich dataset of events reported by each participating country. This paper deals mainly with the presentation of the methodological approaches for event data analysis and exploration, through text mining techniques. In addition, a case study of event data analysis from the database of the European Social Survey is also presented. The analysis was performed through the use of SPAD Software and SAS Text Miner. Finally, conclusions about the use of text mining techniques for event data analysis are drawn in the last section of the paper.

1

http://www.europeansocialsurvey.org

T. Stathopoulou: Using Text Mining Tools for Event Data Analysis, StudFuzz 185, 239–253 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

240

T. Stathopoulou

2 Modelling Techniques for Event Data Analysis In modern society various mechanisms of data generation, storage and retrieval dynamically support the explosion of information. The rapid development of technology constitutes not only the basic component of continuous data production but is also used for data exploration and information retrieval in various sectors of socio-economic life. Generally, information can be found both in structured and unstructured forms in a variety of sources, such as newspapers and magazines, scientiﬁc articles, documents and Web pages, data repositories on Intranets, and the Internet. Actually, this explosive increase of stored information at almost every level of human activity has resulted in the need for new, powerful tools that convert both structured and unstructured forms of data into knowledge. However, the development of applications that support data management, exploration and information extraction is undoubtedly related to sophisticated methodologies and techniques for data and text mining. Moreover, researchers working within various scientiﬁc domains, such as machine learning, pattern recognition, statistics, econometrics, data visualization, database development and information systems, contribute to the continuous evolution of data and text mining processes. The main diﬀerences between data mining and text mining is that data mining usually deals with structured data sets, whereas text mining deals with unstructured or semi-structured data, such as text found in articles, documents, etc. Apart from the fact that the lack of structure within texts increases the rate of diﬃculty for textual analysis, there is also an additional factor of complexity within a text mining process. Each word or phrase does not constitute a unique source of information, since it can be interpreted in various ways according to the content of the document or the paragraph in which it is located. In addition, textual data are written in a speciﬁc language and are usually characterized by high variability. Among the various types of Textual Data the most common are the following: • • • • • • • • • • •

Scientiﬁc documents Legal documents Oﬃce documents News articles Patent information Reference manuals Computer programs/documentation Fiction Poetry Advertising Newsgroups/bulletin boards

Using Text Mining Tools for Event Data Analysis

241

Although Textual Data have some common characteristics, they also have unique attributes according to their type. In addition, the basic types of Textual Data have subtypes. Event Data can be considered a subtype of the “News articles” category. Event data analysis and research was ﬁrst developed in the United States during the decades of 1960 and 1970, in an eﬀort to combine classic diplomatic history with the quantitative analysis of foreign policy.2 The ﬁrst attempts at systematic coding were made by [13] for the World Event/ Interaction Survey (WEIS) and by [2] for the Conﬂict and Peace Data Bank (COPDAB). According to these schemes, events referring to political relationships between countries were coded on the basis of a conﬂict-cooperation scale during a certain time span. The prediction models produced by event data analysts were used by government agencies and consulting companies. However, the impact of these models on foreign policy remained relatively limited during the 1980s. Over the past decade, interest in event data analysis and research has re-emerged as a result of the rapid progress seen in computer technology. The development of machine-readable news reports and automated coding have dramatically reduced the costs of generating, customizing, and analyzing event data. Despite increased interest in event data, there is no single universally accepted deﬁnition of what constitutes an “event”. The early event data projects provided relatively succinct deﬁnitions. According to Burgess and Lawton for example, “event data is the term that has been coined to refer to words and deeds – i.e. verbal and physical actions and reactions – that international actors (such as states, national elites, intergovernmental organizations and NGOs) direct toward their domestic or external environments”. Azar and Ben-Dak gave the deﬁnition of an event as: “Some activity undertaken by an international actor (a nation-state, a major sub-unit of a nation-state, an international organization) at a speciﬁc time and which is directed toward an other actor for the purposes of conveying interest (even non-interest) in some issue. Hence, according to that deﬁnition an event involves (1) an actor, (2) a target, (3) a time period, (4) an activity, and (5) an issue about which the activity revolves”. Gerner et al. (1994, p. 95) deﬁne an event as: “an interaction, associated with a speciﬁc point in time, that can be described in a natural language sentence that has as its subject and object an element of a set of actors and as its verb an element of a set of actions, the contents of which are transitive verbs”. 2

See Stathopoulou 2004b

242

T. Stathopoulou

The key elements of the deﬁnitions mentioned above are: time, natural language, actors, and actions. Event data analysis is based on Linguistic Processing and Statistical Analysis; these two procedures are the main components of Text Mining. Before the Text Mining process can be performed, stochastic variation in event data must be eliminated. The recording and coding of an event is dependent on its type (political, social, and so on), on the cultural particularities of the countries in which it occurs, and above all on the restrictions of national media institutions [1, 19, 20]. More speciﬁcally, this means that events are prone to selection biases, such as event, news agency and issue characteristics, and to description biases, e.g. diﬀerences in reporting [8]. It should be stressed that these biases are the cause of the increased amount of noise observed in event data. This noise must be taken into consideration in the process of event selection. Each word or phrase of a text does not constitute a single piece of information, since it can be interpreted in various ways according to the content of the document or paragraph in which it is located. Moreover, textual data are written in diﬀerent languages; texts are primarily recorded in their initial language and are subsequently translated into English. This process increases the probability of wrongly assigning a particular meaning to a word. As a result, the same event can be recorded in diﬀerent ways by the various participating countries. Although the process of event data analysis is of major importance for the collection of information at various levels, correct coding is necessary for the elimination of errors that usually appear due to misinterpretations that may occur when events are translated from one language to another. As has been explained above, “event data are prone to various sorts of ﬁltering or ‘translations’ imposed by media and coders’ (national reporters’) interpretations” (Stathopoulou 2004b, p. 9) The transition from human coding to machine coding over the last decade has played a major role in the reduction of such errors of interpretation. Machine coding aims at the determination of general language patterns within the event data. The coding programs use dictionaries to convert event reports into event data. Machine coding is the appropriate method for large datasets because it ensures consistency and removes the need for testing inter-coder reliability [7, 15, 17, 21]. However, human coding is generally preferable for small datasets, and it retains a comparative advantage over machine coding: the actual meaning of a word can be better determined by an experienced coder rather than a coding program since a native speaker has the ability to correctly parse a sentence [16]. Apart from text coding, Linguistic Processing is also important as an initial stage of the text mining process [18]. More speciﬁcally, Linguistic Processing is made up of the following stages:

Using Text Mining Tools for Event Data Analysis

243

Data cleansing: cleans the input data by removing irrelevant html characters and punctuation characters. Lemmatization: focuses on restricting the morphological variation of the textual data by reducing each of the diﬀerent inﬂections of a given word form to a unique canonical representation (or lemma). Part-of-speech tagging: automatically identiﬁes the morpho-syntactic categories (noun, verb, adjective) of words in the documents. The non-signiﬁcant words can be ﬁltered on the basis of their morpho-syntactic category. Part-of-speech tagging runs on the textual content, i.e. on titles and abstracts of the patents. Part-of-speech selection: In the analysis the words that come from a particular speech category (nouns, verbs, etc.) can be selected. In addition, the vocabulary size can be reduced by the creation of synonyms. Hence, two or more words can be merged into one.

The use of a dictionary and of a grammar is also necessary for Linguistic processing. Various statistical techniques can be applied in event data analysis according to the type of exploration or the aim of the analysis. For example, Discriminant Analysis can be used for deductive purposes and Cluster Analysis can be used for inductive purposes (Schrodt and Gerner 2000). Discriminant Analysis is based on predetermined event data categories and is used as a validation technique for the discriminative power of the predetermined categories. On the other hand, the cluster analysis method supports the dynamic generation of clusters for events and enables the monitoring of the shifting of each cluster over time. In addition, Factor Analysis is used for dimensionality reduction. Moreover, the words in the various texts (event reports) which refer to particular events each month can be considered as variables that are combined into one factor according to their correlation. Various factors are constructed as linear combinations of words. The physical interpretation of each factor results from the words that compose the linear combination and are most highly correlated with each other. Hence, each factor can represent particular events according to the words (variables) with the signiﬁcant correlation that compose the linear combination. Months or countries can be plotted on a graph where each axis corresponds to a factor. The selection of the appropriate factors is based on the highest amount of variability that is explained. In general, there are many rules for the optimal number of factors (cf. Chatﬁeld and Collins 1995). It is clear that event data analysis is a powerful methodological tool in the ﬁeld of international relations for crisis monitorping. However, its application in the case of the European Social Survey has caused some diﬃculty, particularly in the combination of the macro and micro levels of analysis. The

244

T. Stathopoulou

modelling of events is a sophisticated process. Event data are related to time and to a particular thematic event category within each country. As a result, the data analysis model should be based on a methodological approach that exploits the information of the parameters that characterize event data. More speciﬁcally, these parameters are: 1. 2. 3. 4.

The The The The

time (deﬁned by months). texts that describe the events. thematic category to which an event belongs. country by which an event is recorded.

In order to overcome the problems of measurement apparent in event data, and of the interface between them and the survey questionnaire, it is proposed that event data be approached as an autonomous tool. This tool is based on the simultaneous analysis of the three main components of the event database mentioned above: time, space (the total countries), and events. By means of multi-dimensional analysis, such as Correspondence and Cluster Analysis (Benz´ecri et al., 1973; Lebart et al., 1997; Lebart L., 1998; Johnson and Wichern, 1998), and by taking into account the serious homogenisation problems, the model aims to highlight: 1. The continuous ﬂow of events by month; 2. The possible impact of the events within each country and across countries diachronically; 3. The shifting of the thematic event categories within countries; and 4. The appearance of new thematic categories. The independence of this tool empowers and broadens our analytical capabilities, allowing the better evaluation, monitoring and utilisation of the events. This approach would better serve the needs of this particular survey in the long run, and may also be of use to surveys of a similar nature (cf. Stathopoulou 2004b). The concept behind the tool is based on the structuring of the event data in the following way: • • • •

Country-Event ID (a unique key that identiﬁes a unique event by country) Date (the month of the event reported) Name of the event Short description of the event

In the following section the event data are analyzed with SPAD Software and SAS Text Miner. Although both tools apply linguistic processing and multidimensional techniques, it is interesting to explore their performance in the same data set, so as to investigate their diﬀerences and similarities in the text mining process as a whole.

Using Text Mining Tools for Event Data Analysis

245

3 Event Data Analysis using Text Mining Tools: A Case Study This section presents a practical example of the event data analysis process. The analysis was performed using SPAD Software and SAS Text Miner. SPAD Software is an advanced statistical analysis tool for Data Analysis and Text Mining. The SAS Text Miner is also a module of SAS Enterprise Manager, which specialises in Textual Analysis. The data for the analysis were extracted from the database of the European Social Survey, and refer to three particular months: September, October and November 2002. The structure of the event data was modiﬁed to take the following form: • • • •

Country-Event-Month ID (a unique key that identiﬁes a unique event by country and month) Name of the event Short description of the event

The main diﬀerence between the new structure and the one outlined in Sect. 2 is in the ID code assigned to each event. Because the data from the three months were uniﬁed into one data set, the IDs used in the new structure also refer to the month in which the event occurred. This new coding of the textual data reduces the importance of the month in the monitoring of monthto-month event shifting. However, this particular structure does not cause any problems for the demonstration of text mining processes, since the basic aim of this case study is to highlight the similarities and diﬀerences between the two tools used. The basic steps in this process are: 1. 2. 3. 4.

The import of Event Data into text mining tools Linguistic Processing (automated & manual) Factor Analysis Cluster Analysis

3.1 Event Data Import into Text Mining Tools SAS Software can accept various input data set formats. The initial event data were stored in an Excel ﬁle, and were converted into an SAS System data set when they were imported into the SAS software. Subsequently, Text Miner was activated, and the SAS data set was imported into an SAS text miner project. Various formats of data can also be imported into SPAD Software. The event data were imported as text ﬁles (tab delimited). 3.2 Linguistic Processing (Automated & Manual) The ﬁrst step for the performance of linguistic processing in Text Miner is the construction of a diagram that illustrates the basic components of the text mining process. Text miner provides a graphical representation of statistical

246

T. Stathopoulou

Fig. 1. Text Miner Process Diagram

methodologies, which can be dragged and dropped in the main window of text miner. Within the framework of event data analysis the initial data set was connected to the Text Miner process. This particular diagram is illustrated in Fig. 1. Through Text Miner activation, a part of the linguistic process is applied. More speciﬁcally, Text Miner automatically performs data cleansing, Lemmatization and Part of Speech tagging. Both the selected and rejected terms can be presented in the main window of Text Miner. In addition, the user can observe the syntactic role of the lemmas. Figure 2 illustrates the main window (interface) of Text Miner, in which the selected terms are displayed. Within this interface the user can also see the dropped terms. Through the “Toggle Keep Status” option the ﬁnal part of Linguistic Processing (word selection) is applied. And so, the ﬁnal event data set, which is used for further analysis, is constructed. All the stages of Linguistic Processing in SPAD are also presented in a diagram. In general, SPAD software requires three types of criteria: 1) particular criteria related to the number of texts that will be used for the production of the ﬁnal lemmas; 2) criteria for the sorting of the lemmas during their manual selection; and 3) selection criteria based on word frequency and segment size. The Text Miner process, on the other hand, is more automated, as the user does not have to specify multiple criteria for the construction of the ﬁnal vocabulary. In the ﬁnal stage of Linguistic Processing, the words and segments for the analysis are selected manually (Fig. 3). Although Text Miner automatically provided 150 words and segments appropriate for analysis, the SPAD software

Using Text Mining Tools for Event Data Analysis

Fig. 2. Main Interface for word and segment selection in Text Miner

Fig. 3. Linguistic Processing Stages in SPAD

247

248

T. Stathopoulou

initially provided 8100 lemmas as no rejection criteria were deﬁned in the previous steps. The words and segments that made up the ﬁnal event data set for analysis were the same as those used in the SAS software, so that a comparison between the two Text Mining processes could be made. While the user in SPAD has more freedom to select a variety of words and segments according to subjective criteria, the usage of homogenous datasets for the analysis supports the comparative monitoring of the Text Mining processes. 3.3 Factor Analysis In order to represent the meaning of text, it is important to analyse the relationship that exists between the isolated terms by retrieving the key concepts contained in the documents. Text Miner provides two approaches to generate such concepts: Singular Value Decomposition (SVD) and “rolled-up terms”. For the dimension reduction the Single Value Decomposition method was applied, and cluster analysis was applied to the key concepts of the texts that referred to events. SPAD performs Factor Analysis and produces 2 dimensional graphs of factor combinations, which can be modiﬁed by the user through the interactive determination of factor combinations. Figure 4 illustrates an example of Factor 1 and Factor 2, and some of the lemmas which contribute to the construction of the factors. According to the diagram in Fig. 4, some lemmas are mainly positively associated with Factor 2 and some others constitute the linear combination that composes Factor 1 (Fig. 5).

Fig. 4. Representation of Factor 1 and Factor 2 as linear combinations of lemmas

Using Text Mining Tools for Event Data Analysis

249

Fig. 5. SAS Text Miner Cluster Settings

In addition, SPAD provides various statistics related to factor analysis. Those results are presented in various output formats such as internal output editors or Excel ﬁles. 3.4 Cluster Analysis Once the vocabulary of the event data was constructed in Text Miner, cluster analysis was performed for the creation of event categories. Text Miner provides two clustering techniques geared speciﬁcally toward the analysis of text-based data. One method uses a hierarchical clustering algorithm where each document is placed within a speciﬁc sub tree. The other method makes use of “fuzzy clustering” where each document has a probability of membership in each cluster. The clustering of the event data was based on the hierarchical algorithm. Once cluster analysis is performed, the user can view the results in the results window, as shown in Fig. 6. The document of the clusters and the various statistics of the analysis can be saved as a data set, which can be used for further analysis. Cluster Analysis was performed on the factors that were produced through factor analysis in order to generate event categories in SPAD. In this part of the analysis the SPAD software provides various criteria relating mainly to: 1) the number of factors to which Cluster Analysis is applied; and 2) the results that will be displayed once the Cluster Analysis is completed. In addition, criteria can be speciﬁed for the Cluster Analysis algorithm and for the number of the clusters that are produced [14].

250

T. Stathopoulou

Fig. 6. Cluster Results in SAS Text Miner

Fig. 7. Cluster Analysis criteria in SPAD

In SPAD software, the classes of the event data together with the words that contribute to their creation are represented in a graph, where each axis represents one factor. The structure of this graph is illustrated in Fig. 8. Cluster Analysis performed in SAS and SPAD produced event categories which contained events per country for three consecutive months (September, October and November). Even though the same data set was imported into both of the two text mining tools, and the same methodological approaches were applied, diﬀerent ﬁnal results were produced. The SPAD Software generated 15 event categories, while SAS produced 16 clusters. Although the content of some clusters produced by the two tools was similar, in general the ﬁnal results were not the same. SPAD applies the Valeurs Test to Factor and Cluster analysis in order to reduce the noise in the data. This means that the

Using Text Mining Tools for Event Data Analysis

251

Fig. 8. Clusters and lemmas plotted against the factors

ﬁnal clusters are described by the lemmas with the maximum contribution, and as a result some codes do not appear in the clusters because the lemmas that contained them did not contribute signiﬁcantly to the analysis [12]. The ﬁnal results of the SAS Text Miner classiﬁes and displays all the event codes; in contrast, the SPAD clusters contains the events that bear the strongest correlations between them. SPAD software is more eﬀective in terms of robustness and consistency, but some events are missing and remain unclassiﬁed due to the logic of the analysis.

4 Conclusions Event data analysis constitutes a process that cannot be performed eﬀectively through the use of automated methodological approaches alone. The process requires automation in the ﬁrst phase of event coding, so that event data sets produced by various countries can be homogenised. Automation is also useful for Linguistic Processing, and for the cleansing and the syntactic categorisation of the data. However, all subsequent stages of the process (the selection of the ﬁnal words and segments) demands the contribution of human logic and experience. Varying results can be produced from the event data set that is created by lemmas selection. Varying results can also be produced by a single Text Mining tool depending on the criteria selected at each stage. To conclude, the monitoring of event data diachronically and the exploration of event shifting from month to month can be carried out with the aid of both Text Mining tools and human reasoning. However, regular checks are necessary, so as to ensure that the data are properly homogenised and

252

T. Stathopoulou

grouped. In this way noise is ﬁltered out, allowing the production of meaningful results. If conducted with care, event data analysis using Text Mining techniques constitutes a dynamic analytical tool that can produce robust and consistent results.

References 1. Aarts, Kees, Semetko, Holli, A. (2003). The divided electorate: Media use and political involvement. The Journal of Politics. vol. 65, no. 3, pp. 759–784. 2. Azar, E. (1980) The conﬂict and peace data bank (COPDAB) project. Journal of Conﬂict Resolution 24. 3. Azar, Edward E. and Joseph Ben-Dak. (1975). Theory and Practice of Events Research. New York: Gordon and Breach. 4. Benz´ecri J.-P. et al. (1973). L’Analyse des Donn´ees, volume II: L’Analyse des correspondances. Dunod. 5. Burgess, Philip M. and Raymond W. Lawton. (1972). Indicators of International Behavior: An Assessment of Events Data Research. Beverly Hills: Sage Publications. 6. Chatﬁeld, C., Collins, A. (1995). Introduction to Multivariate Analysis. Chapman & Hall/CRC. 82–87 7. Gerner, et al. (1994). Machine coding of events data using regional and International sources. International Studies Quarterly, 38, pp. 91–119. 8. Earl, J., Martin, A., McCarthy, J.D., and Soule, S.A., 2004. The Use of Newspaper Data in the Study of Collective Action. Annual Review of Sociology 30, pp. 65–80. 9. Johnson R. and Wichern D. (1998). Applied multivariate statistical analysis. Prentice-Hall, Inc. 10. Lebart, L., Morineau, A., Piron M. (1995). Statistique exploratoire multidimensionnelle. 181–184 (1995) 11. Lebart L., Morineau A., and Piron M. (1997). Statistique exploratoire multidimentionnelle. Dunod, 2me edition. 12. Lebart, L. Salem, A. Berry, L. (1998). Exploring textual data. Kluwer Academic Publishers. 13. McClelland, C.A. (1978). World Event/Interaction Survey (WEIS) project, 1966-1978. Third ICPSR ed. Ann Arbor, MI: Inter-University Consortium for political and social research. 14. Rajman et al. (2002). Evaluation of Scientiﬁc and Technological Innovation using Statistical Analysis of Patents. 6es Journ´ees internationales d’Analyse statistique des Donn´ees Textuelles, Jadt 15. Schrodt, A.P. (2001). Automated coding of international event data using sparse parsing techniques. Paper presented at the annual meeting of the International Studies Association, Chicago, February 2001. 16. Schrodt, A.P. and Gerner J.D. (2000). Analyzing International event data: a handbook of computer-based techniques. Draft manuscript for Cambridge University Press. Retrieved March 27, 2004 from http://www. Ku.edu/∼keds/ papers.dir

Using Text Mining Tools for Event Data Analysis

253

17. Schrodt, A.P. et al. (2001). Monitoring conﬂict using automated coding of newswire reports: a comparison of ﬁve geographical regions. Paper presented at the PRIO/Uppsala University/DECRG High-level Scientiﬁc Conference on Identifying wars: Systematic Conﬂict Research and Its Utility in Conﬂict Resolution and Prevention, 8–9 June, Uppsala, Sweden. 18. Spinakis, A. Panagopoulou, G., Chatzimakri, A. (2004). Sting: A Text Mining Tool Supporting Business Intelligence. 19. Stathopoulou, T. (2004a). Event data in European Social Survey: problems of coding and analysis. Paper presented at the NCSR (National Centre for Social Research) Conference on “Politics and Society. Results from the European Social Survey”, June 4, Athens, Greece. 20. Stathopoulou, T. (2004b). Modeling events for ESS: toward the creation of an autonomous tool for survey research. Paper presented at the Sixth International Conference on Social Science Methodology, August 16–20, Amsterdam, The Netherlands. 21. Thomas, D.G. (2000). The machine-assisted creation of historical event data sets: a practical guide. Paper presented at the International Studies Association Annual Meeting, March 14–18, Los Angeles, California.

Terminology Extraction: An Analysis of Linguistic and Statistical Approaches Maria Teresa Pazienza1 , Marco Pennacchiotti1 , and Fabio Massimo Zanzotto2 1 1

Artiﬁcial Intelligence Research Group, University of Roma Tor Vergata, Italy {pazienza,pennacchiotti}@info.uniroma2.it 2 University of Milano Bicocca, Italy [email protected] Abstract. Are linguistic properties and behaviors important to recognize terms? Are statistical measures eﬀective to extract terms? Is it possible to capture a sort of termhood with computation linguistic techniques? Or maybe, terms are too much sensitive to exogenous and pragmatic factors that cannot be conﬁned in computational linguistic? All these questions are still open. This study tries to contribute in the search of an answer, with the belief that it can be found only through a careful experimental analysis of real case studies and a study of their correlation with theoretical insights.

1 Introduction The studies on the deﬁnition and implementation of methodologies for extracting terms from texts assumed since the beginning a central role in the organization and harmonization of the knowledge enclosed in domain corpora, through the use of speciﬁc dictionaries and glossaries [34]. Recently, the development of robust computational Natural Language Processing (NLP) approaches to terminology extraction, able to support and speed up the extraction process, lead to an increasing interest in using terminology also to build knowledge bases systems by considering information enclosed in textual documents. In fact, both Ontology Learning and Semantic Web technologies often rely on domain knowledge automatically extracted from corpus through the use of tools able to recognize important concepts, and relations among them, in form of terms and terms relations. While terminology extraction (hereafter intended as the study of NLP based methodologies to extract terms from textual domain corpora) has found widespread application in Artiﬁcial Intelligence systems, the notion itself of 1

This research has been developed during his sojourn at the AI Research Group at Roma Tor Vergata University

M.T. Pazienza et al.: Terminology Extraction: An Analysis of Linguistic and Statistical Approaches, StudFuzz 185, 255–279 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

256

M.T. Pazienza et al.

term is still not clear, both from a pure linguistic and a computational point of view. Operatively it is thus possible to give only a general deﬁnition of term, as “a surface representation of a speciﬁc domain concept” [24, 30]. The diﬃculty in ﬁnding a deeper deﬁnition of term, in deﬁning the properties that characterize univocally a term, and in “translating” operatively these properties in a running system, still plays a central role in researches on computational linguistics. Recently, properties to deﬁne terms as termhood and unithood have been proposed in the literature [27], together with statistical measures and linguistic techniques able to “translate” such properties in computational algorithms. In this study we present a few, commonly agreed, statistical and linguistic approaches used in NLP to extract and recognize terms, trying to compare their strengths in an automatic development environment. Moreover, we deﬁne a hybrid linguistic-statistical strategy that seems us to guarantee the extraction of a reliable terminology. Our approach has been implemented in a Terminology Extraction architecture (see later in Sect. 3), whose aim is both to verify the validity of the many statistical measures proposed in literature and to evaluate existing and new linguistic methods for term recognition; as test bed a spacecraft design corpus provided by the European Space Agency has been used. As the issue on statistical measures adopted for recognizing terms is still matter of debate, in this study we will focus also on both methodological aspects and follow-up of the measures, trying to relate experimental evidences to their statistical methodological perspectives. In Sect. 2 we present and classify the diﬀerent statistical, linguistic and hybrid approaches proposed in the literature, together with associated statistical measures and linguistic ﬁlters. In Sect. 3 we describe our Terminology Extraction methodology, carefully comparing measures and linguistic techniques on a common test bed. Finally, in Sect. 4 we try to outline conclusions and open questions.

2 Terminology Extraction Approaches Current and past researches on computational terminology deal with a variety of approaches and strategies to extract and recognise terms by using both supervised and unsupervised techniques. Aim of most researches has been to obtain from a domain corpus the most signiﬁcant set of terms, that is, the set of superﬁcial representations of domain concepts that better represents the domain for a human expert. In order to better understand and organize the work produced in the ﬁeld, it can be useful to identify two mainstream approaches to the problem. From one side, statistical measures have been proposed to deﬁne the degree of termhood of candidate terms, i.e., to ﬁnd appropriate measures that can help in selecting good terms from a list of candidates. From the other side, computational terminologists have tried to deﬁne, identify and recognise terms

Terminology Extraction

257

looking at pure linguistic properties, using linguistic ﬁltering techniques aiming to identify speciﬁc syntactic term patterns. Finally, hybrid approaches try to use these two views together, taking into account both linguistic and statistical hints to recognise terms. In this section we will present those that we regard as the main approaches adopted in the two mainstream view. Even if historically statistical approaches have been introduced before the linguistic ones, we ﬁrstly present the latter, since modern hybrid systems are usually composed by a cascade of a ﬁrst linguistic analysis followed by statistical ﬁlters. 2.1 Linguistic Approaches Linguistic approaches to term recognition basically try to identify terms capturing their syntactic properties: in facts, it has been proved (see [8]) that terms usually have characteristic syntactic structures, called synaptic compositions: since the beginning, candidate terms have been mostly identiﬁed with noun phrases (e.g., the PHRASE system [17]). Among researches that rely solely on linguistic analysis, in [9] it is postulated that syntactic data are suﬃcient to carry out term recognition. In this study the linguistic analysis is divided in two phases. Firstly, candidate terms are extracted using frontier markers that discard text sequences unlike to contain terms (such as phrases containing verb and pronouns). Then, relying on the analysis produced by a shallow syntactic parser, parsing rules are applied to the fragments survived to the ﬁrst phase to select actual terms. Rules are created in an empirical way looking at experimental data. An example rule (for French) extracts from fragments of the type [noun1 adj prep det noun2 prep noun3 ] terms like [noun1 adj noun2 prep noun3 ]. More recent works see the linguistic analysis simply as a set of linguistic ﬁlters, through which a system is able to retain admissible forms. Among other [3, 4] describe an approach to term extraction based on linguistic knowledge; moreover in [19] basic forms of English terms are [noun, noun] and [adjective, noun], from which more complex syntactic patterns can be derived. In many works (such as [13, 26]) a simple regular expressions is supposed to be suﬃcient to identify the candidate terms forms. In this direction a great eﬀort has been done by [14]: in their extended study they try to identify the most common syntactic structures that terms assume, as inferred from the analysis of human produced terminological data banks. The study conﬁrms the widely acknowledge intuition that terms generally appear in form of short noun phrases, mainly composed by only two main items, that is only two meaningful words, such as noun, adjectives (adj ) and adverbs. These core terms, consisting of one or two main items, are called base-terms. The study identiﬁes two major syntactic forms of base terms for English, [adj noun], [noun noun], and three for French, [adj noun], [noun noun], [noun prep noun]. From the restricted set of base-terms more complex and long terms are formed via morphological or syntactic variations.

258

M.T. Pazienza et al.

Being thus the base-terms considered as forming the core terminology, most approaches in term recognition [13] focus only on them. In such a view the extraction of candidate terms from a domain corpus is usually carried out as a cascade of two modules: • A parsing module, able to perform a shallow linguistic analysis. Using Part of Speech (PoS) tagging techniques [2, 10] the module should guarantee the identiﬁcation of nouns, verbs, adjectives and other part of speech in the text. • A simple term recogniser module, that using regular expressions (or similar languages) extracts from the tagged text only the admissible surface forms, ﬁltering out non interesting forms. A debated issue on terminology recognition relates the identiﬁcation of term variations [24]. As in [14], a term variant may be deﬁned as “an utterance which is semantically and conceptually related to an original term”. For example the expression lunar spacecraft mission can be seen as a variant of the term spacecraft mission, conveying the meaning of the term augmented with another speciﬁc semantic information. The study on term variants plays a role in term recognition, since particular type of variants can be seen as transformed forms of a term, that express exactly the same meaning of the related term (synonymy): for instance the variant mission of spacecraft is a “meaning-preserving” transformation of the term spacecraft mission. In case of meaning-preserving variations, in terminology recognition it can be justiﬁed to consider the original term and the variation as a unique term, “collapsing” the variation into the term. On the other side, non meaning-preserving variations can be seen as a way to identify complex terms built from base-terms. In [14] term variations are classiﬁed according to their characteristics. For term extraction, permutations (permuting a base-term with the of preposition, e.g. [mission of spacecraft]) assume a primary role, being one the strongest “meaning-preserving” transformation. Other interesting studies on the subject have been carried out by [24], devoted to the identiﬁcation of particular kind of variants in perspective of a semantic structuring of terminologies. Within a linguistic approach framework, other techniques can be applied in order to reﬁne the terminology. For example a list of unwanted words (stoplist) can be used to discard those candidate terms that contain one of them. Usually the approach is to insert in the stop list function words and generic words, that is, words that are of very common usage in the language (for example “this”, “that”, “thing”). In most approaches stop-list words are automatically extracted from a generic corpus as those with the highest frequency, and are then validated by human experts. A stop-list can eliminate false terms consisting in generic collocations very common in the language, such as “this thing” or “some day” that being in the form [adj noun] could be likely selected as admissible surface forms.

Terminology Extraction

259

To sum up, an ideal term recognition process within a linguistic approach should be able to: • • • •

parse the domain corpus, identifying at least PoS; identify and extract candidate terms through admissible surface form rules; collapse meaning-preserving variations in the original term; implement other linguistic ﬁlters to reﬁne the terminology.

What is produced at the end of the process is a list of good candidate terms likely to constitute the ﬁnal terminology. However, a further analysis step is needed. In fact, the linguistic forms contained in the candidate terminology at this stage can be deﬁned as ﬁltered admissible surface forms, but not true terms. For example in a space domain candidate forms as suﬃcient number or maximum size, that are not speciﬁc domain expression, can easily survive the linguistic ﬁlters. What it needs is thus a step to select true terms from admissible surface forms. In other words, it must be implemented a sort of termhood deﬁnition in the process, able to discriminate among the surface forms. In pure linguistic approaches this process takes the form of a human expert manual validation. Unlikely, manual validation is not straightforward as it seems (see Sect. 2.4). The development of computational model able to capture the notion of termhood and to consequently identify true terms after the linguistic step, is thus clearly needed. Computational model usually consist in the application of statistical measures to the candidate term list, as described in the next session. The linguistic approach thus becomes a hybrid one. 2.2 Statistical Approaches Statistical measures applied to terminology are of great help in ranking extracted candidate terms according to a criterion able to distinguish among true and false terms and able to give higher emphasis to “better” terms. What is expected an ideal statistical measure could do is to assign higher scores to those candidates supposed to strongly possess a peculiar property characterizing terms. What is this property and what “better” means cannot be clearly stated: once again, an agreed deﬁnition of termhood could be helpful [7, 34]. Statistical approaches, like the linguistic ones, used alone only seldom reach truly satisfying results. While in pure linguistic approaches what lacks is a sort of “implementation” of termhood, the direct application of sole statistical measures to not-linguistically-ﬁltered expressions can lead to a terminology rich of unwished forms. Indeed, only a few methods implement directly statistical measures without a syntactic-semantic analysis of the corpus. An example of pure statistical method is presented in [32], where 2-word candidate terms are extracted simply taking groups of two adjacent words, that are then weighted by the T f ∗ Idf statistical measure. In [25] sequences of words with length N are extracted, and then evaluated with an empirical measure based on term length and frequency.

260

M.T. Pazienza et al.

Table 1. A classiﬁcation of statistical measures in statistical and linguistic dimensions Statistical Dimension Degree of Association

Signiﬁcance of Association

Unithood

MI Dice Factor

z-score T-score X2 Log Likelihood Ratio

MI2 MI3

Termhood

”

”

Frequency C-Value Co-Occurrence

Linguist. Dim.

Heuristic

In this section some of the major statistical measures for term recognition are described: our interest is in analyzing their eﬀectiveness in combination with linguistic knowledge in hybrid approaches. Statistical measures can be classiﬁed by the following two distinct dimension: linguistic and statistical. A linguistic dimension is proposed in [27]: measures are divided in those that express termhood and those that express unithood : • Unithood: expresses strength or stability of syntagmatic collocations. • Termhood: expresses how much (the degree) a linguistic unit is related to domain-speciﬁc concepts. By deﬁnition, unithood characterizes complex linguistic units (called collocations) composed by words with a strong association, such as compound words, idiomatic expression (e.g., day after ) and complex terms (e.g., spacecraft mission). Therefore, unithood, while capturing an important aspect of terms, is not a peculiar property of them. Moreover being a measure of association, unithood is signiﬁcant only for multiword terms, and cannot thus be applied to evaluate single word terms. On the contrary, termhood is a peculiar characteristic of terms, single word and complex. The statistical dimension is based on statistic principles. Measures are classiﬁed accordingly to their methodological approach and the underlying assumptions2 in: • Degree of association measures • Signiﬁcance of association measures • Heuristic measures A resuming classiﬁcation graph is represented in Table 1, in which both dimensions have been depicted. 2

Classiﬁcation proposed by the Institut for Maschinelle Sprachverarbeitung, University of Suttgart in www.collocation.de

Terminology Extraction

261

While heuristic measures are based on empirical and intuitive assumptions that often lack a theoretical statistical justiﬁcation, the former two types of measures are usually based on a strong statistical background, as brieﬂy described hereafter. Association measures refer mainly to methods to estimate unithood. They are thus not used only in terminology, but in general for estimating collocations between two words3 u and v, relying on the statistical evidence of occurrence of these words in the corpus. These evidences are expressed through a contingency table of observed frequencies, where U and V indicate respectively the ﬁrst and the second words of the collocation. Co-occurrence of (u, v) is thus indicated by the frequency O11 , while N is the total number of collocation couples in the corpus (N = O11 + O12 + O21 + O22 ).

U=u U = u

V=v O11 O21

V = v O12 O22

Moreover, marginal frequencies are deﬁned as: R1 = O11 + O12

R2 = O21 + O22

C1 = O11 + O21

C2 = O12 + O22

The aim of association measures is to draw inferences from the frequency table to estimate a collocation value. More in particular a random sample model is used, in order to generalize the observations in the frequency table of a single corpus (the sample) into assumptions valid for the language in general (the population). Consequently, being the measures an estimation, they will be prone to sampling errors. Speciﬁcally, what has to be estimated is a contingency table valid for the whole language, where Xij are the frequencies of collocations in the whole language. Assuming independence (occurrence of collocations are mutually independent) and stationary (the probability of seeing a particular word in the corpus does not vary) of the collocation event [16], values of Xij can be derived from a Bernoulli Distribution, with τij as probability parameters representing the probability that in the language the collocation Xij outcomes in a single trial. It is then necessary to ﬁnd an estimate of these values. Two ways can be followed: use a direct estimation of the parameters (as the degree of association measures do), or set some work hypotheses about them (as is the case of signiﬁcance of association measures). Degree of association measures estimate probability parameters from corpus evidences using maximum-likelihood estimate (MLE). Given the corpus frequencies, τij are thus estimated as: 3

All measures examined in the study are referred to the case of two-word terms, since they can be considered the most important and typical terms in a core terminology.

262

M.T. Pazienza et al.

τ11 ≈ O11 /N

τ12 ≈ O12 /N

τ21 ≈ O21 /N

τ22 ≈ O22 /N

Moreover, the probabilities of occurrence of the ﬁrst and the second words in the language (respectively π1 and π2 ) can be estimated as: π1 ≈ R1 /N

π2 ≈ C1 /N

Combining the parameter estimates, diﬀerent kind of measures can be derived. This approach is obviously prone to estimation errors, that are more likely to emerge when frequencies are low. To avoid estimation error caused by the MLE, signiﬁcance of association measures try to calculate collocation using the (null hypothesis of independence, HI ): τ11 ≈ π1 · π2 HI states that probability parameters π1 and π2 are independent. From the point of view of terminology thus means that there is not interesting relation between the two words composing the term. Under HI, using MLE of π1 and π2 , it is thus possible to obtain the expected frequencies of collocation E11 , as the mean of the binominal distribution: E11 = τ11 · N = π1 · π2 · N ≈

R1 · C1 N

HI is usually used by signiﬁcance of association measures to compare the joint probability derived from a corpus with the joint probability in case of independence. Statistical Measure In Table 2 the major statistical measures used in terminology recognition and evaluated in our study are presented. Frequency doesn’t derive from a theoretic statistical principle, but from the simple assumption that a frequent expression denotes an important concept for the domain in exam and should thus assume a high position in the rank of candidate terms. The most important objection in using frequency as a measure for term recognition concerns the fact that it doesn’t take into consideration the degree of association (unithood ) among words composing multiword terms [6]. Thus, very frequent expressions are considered good candidates while not being terms (e.g. “this day”). In order to capture indirectly the unithood nature of terms while using frequency, it is then necessary to implement linguistic ﬁlters able to discard candidates that don’t have speciﬁc syntactic or morphological properties [26]. Frequency has been proved in several experimental studies (such as [13] and [28]) to be one of the most reliable measures for term recognition.

Terminology Extraction

263

Table 2. Statistical measures and related formulae Measure

Adopted Formula

Frequency Church Mutual Information Mutual Information variants Dice Factor T-score Log Likelihood Ratio

f = O11 /N M I = log2 (O 112/E 11 ) 3 E11 M I 3 = log2 O11 E11 M I 2 = log2 O11 O11 DF = 2 R1 +C1 √ T S = (O11 − E11 ) O11 L(O11 ,C1 ,r)·L(O12 ,C2 ,r) LLR = −2 log L(O11 ,C1 ,r1 )·L(O12 ,C2 ,r2 ) where: L (k, n, r) = rk (1 − r)n−k r = R1 /N r1 = O11 /C1 r2= O12 /C2

C-value

CV = (len − 1). f −

Co-Occurrence

CO = − N M|N |

O11i

f (t) |t|

Mutual Information was originally deﬁned in information theory [21], and then applied to linguistic analyses. In order to calculate Mutual Information it is necessary to estimate the probability parameters: in Table 2 we use the MLE, as proposed in [11]. A known problem of MI as presented in [10] is that it doesn’t perform well with low frequency [13, 16]: in facts, the measure overestimates collocations composed by low frequency words. A solution to solve this problem, proposed in [11], is to exclude from the corpus collocations with frequency lower than a certain threshold. Another and more general solution is to ﬁnd heuristic variants of the MI formula, such as MI2 and MI3 [13], that try to cope with low-frequency giving more importance to O11 , while laking a precise theoretic justiﬁcation. Dice Factor Reference [33] suﬀers the low-frequency problem, as MI. In facts, DF is conceptually similar to MI, but, while the former theoretically derives from harmonic means, the latter is linked to geometric means. T-Score Reference [12] rely on the asymptotic hypothesis tests, as other measures, such as Z-score [15]. The aim of T-score is to approximate the discrete binominal distribution (that is assumed to model collocations) with a distribution that converges to the continuous normal distribution for large N , relying on the null hypothesis of independence. Being a normal approximation of the binominal distribution, T-score suﬀers the well-known problems of assumption of normality [16]. Log-Likelihood Ratio Reference [16] tries to solve the estimation problem of T-score and MI. The idea is to compare the probability of obtaining the contingency table observed in the corpus under the null hypothesis to the probability when there isn’t independence, estimating the probability parameters τ11 , π1 and π2 with MLE and calculating the binominal distribution corresponding to the contingency table (parametric test).

264

M.T. Pazienza et al.

C-value Reference [22] is a linguistic based measure of termhood for multiword terms, that takes into consideration the frequency of the candidate term (f ), the number of its main items (len) and information about how other candidates derived from the term are distributed in the corpus (t is the set of these candidates and f (t) their overall occurrences). Co-Occurrence heuristically tries to capture termhood, relying on the assumption that a characteristic of terms is to co-occur in a same section of text with other terms (N are the corpus paragraphs in which the speciﬁc term appears, and O11i the occurrences of the M terms in these paragraphs). Other Measures have been used for term recognition (but are not taken into consideration in our experiments): for example, T f ∗ Idf [23], Domain Relevance & Domain Consensus [6], Contrastive measures [34]. Many of these measure use a contrastive analysis of the domain corpus against a generic corpus (or many other speciﬁc corpora) in order to select terms. 2.3 Hybrid Approaches Recent terminology extraction systems combine linguistic and statistical techniques in structured hybrid approaches. Linguistic analysis is carried out before the application of statistical measures, to be helpful in selecting all linguistic admissible candidates over which will be applied numerical tests. Moreover, the reliability of a statistical measure increases when applied over linguistic justiﬁed candidates. The statistical step works on a list of candidate selected by the linguistic ﬁlters, trying to select and rank them according to a deﬁnition of termhood or unithood implemented through a speciﬁc measure. One of the ﬁrst systems using an hybrid approach is presented in [17], where noun phrases are ﬁrstly extracted as term candidates and then selected according to the frequency of their noun elements. In [13] linguistic candidates obtained by the application of syntactic patterns are ﬁltered using diﬀerent statistical measures, such as LLR, MI and frequency. In [26] a similar approach is followed: regular expression are used in order to extract from the corpus linguistic candidates, that are then ranked by frequency. A more complex architecture is envisioned in [18], where simple terms are ﬁrstly extracted according to frequency. New and more complex terms are then derived through linguistic heuristics and frequency ﬁlters applied to the simple terms retrieved in the ﬁrst phase. A step further is to deepen the linguistic analysis using semantic and contextual information. In [1] semantic information derived from thesauri, linguistic hints and statistical evidences are mixed together to rank candidate terms. For this purpose the NC-value, a complex heuristic measure, is proposed as a combination of C-value and of that context-factor, that takes into consideration the semantic, syntactic and statistical properties of the contexts in which the candidate terms appear. The use of extrinsic information (e.g., contexts) is common also to other approaches. In [6] a shallow syntactic parser is used to select candidate term

Terminology Extraction

265

patterns; then Domain Relevance and Domain Consensus measures are applied to rank terms according to their contexts, intended at a wider domain level. In [34] an extensional deﬁnition of term is proposed, in order to boost the term recognition process using frequency as a statistical measure, together with lexical and syntactic information about the contexts in which the term appears. 2.4 The Evaluation Issue The evaluation of a term recognition system, as quality of extracted information, assumes a high relevance (further to performance evaluation) to both verify the validity of the underlying theoretical assumptions and to evaluate linguistic theories. Unlikely, even though automatic term extraction and recognition have a long tradition, no golden standards for evaluation have been introduced to clearly evaluate and compare diﬀerent approaches. The diﬃculty in outlining a generic and widely acceptable standard stems from the intrinsic nature of term. Indeed, as outlined in Sect. 1, it is even diﬃcult to give a precise linguistic deﬁnition of term. While an operational deﬁnition can be postulated, the problem remains for what concerns evaluation: then the need of a golden standard against which to measure systems performances. A golden standard can be provided directly or through validation only by a human expert. It is thus prone to the expert’s subjective and personal interpretation of terms. This layer of undetermination leads to more practical problems at a methodological level, where a method for evaluating an automatically extracted terminology is needed. Mainly two diﬀerent methods are usually adopted for evaluation purposes: reference list and validation. In the ﬁrst case an a priori list of terms is assumed as a golden standard: in most cases the list is an already existing terminology for the speciﬁc domain. A reference list can also be constructed by a human expert examining the same corpus used for the automatic extraction. The quality performance of a system is evaluated in term of Precision (the percentage of extracted terms that are also in the reference list) and Recall (the percentage of terms in the reference list extracted by the system). Validation method is preferred when a golden standard is not available or when particular characteristics of the extraction process have to be made explicit. In this case the performances are evaluated by a human expert that validates the terms extracted by the system. A Precision score is thus derived as the percentage of extracted candidate terms that have been retained as terms by the expert. Of course, manual validation is a time consuming activity. In [34] an account of what are the procedures and the diﬃculties in carrying out the process is given. In particular, manual validation requires two things. Firstly, the validation has to be done by more than one expert, in order to have the most reliable resource. Secondly, each expert must be introduced on

266

M.T. Pazienza et al.

the notion of what a term is: indeed, since the deﬁnition of termhood is pretty vague, it is likely that experts produce diﬀerent validations, based on their own intuition of term. Both methods have pro and cons. In terms of performance measures, the reference list technique is not the most suitable means to calculate Precision. In facts, it can happen that the system extracts true terminological expression that are not present in the reference list: while being good terms, these candidate are then recognized as false ones. From the other side, validation method is not able to capture Recall, since no other terms exist than those extracted by the system. Moreover, validation is a more system-dependent method, since it must be repeated for each system even when they operate on the same domain. Validation is also too much dependent on the personal judgement of the expert, that can be inﬂuenced in his validation task by external factor and by the list of terms already examined. In the literature the problem of evaluation is still present, and maybe it will never be solved, thus limiting the development of an eﬀective and standard framework in which to develop terms related technologies. In fact, since some systems adopt the reference list (e.g. [13]) and others the validation method (e.g. [6, 20]), it is impossible to clearly compare performances and thus to draw a precise line of evolution in term recognition methodologies.

3 Term Recognition in Practice: An Hybrid Approach In order to override such an empasse we have carried on an in-deep analysis of the main methods used for term recognition in literature and cited in the previous sections. In particular we focus from one side on establishing a robust linguistic model to extract terminological expressions, and from the other side on evaluating and comparing diﬀerent statistical measures when applied over. A wide debate is in fact active about the statistical validity and the mathematical foundation of many of the previously described measures (in particular those based on heuristic assumptions) [27]; a comparative study can be thus useful in order to understand their weaknesses, strengths, lacks and values. In such view, the overall term recognition process we envision can be classiﬁed as an hybrid approach composed by both a linguistic and a statistical step. To evaluate diﬀerent linguistic and statistical methodologies we tested our recognition process over a speciﬁc test bed. The corpus consists in a collection of domain speciﬁc documents related to spacecraft design, provided by the European Space Agency (ESA) in the framework of the Shumi Project [31] jointly conducted by the AI Research Group of Roma Tor Vergata and the ESA/ESTEC-ACT (Advanced Concept Team). The collection comprehends 32 ESA reports, tutorials and glossaries, forming 4,2 MB of textual material (about 673.000 words). Once extracted, candidate terms have been validated by a team of ESA experts.

Terminology Extraction

267

3.1 Linguistic Step As described in Sect. 2.1 linguistic techniques to extract terms from textual corpora mainly consist on syntactic ﬁlters used to retain particular linguistic forms (i.e., syntactic patterns) as candidate terms. Moreover, stop-lists and term variations can be taken into consideration as further reﬁnement. In order to examine these diﬀerent techniques and to better understand the nature of terminology, we envision the linguistic step as an incremental process in which techniques performance are evaluated. Firstly, we extracted from the corpus those linguistic forms corresponding to speciﬁc syntactic patterns (admissible surface forms) (Table 3) considered as good prototypes of candidate terms, that are classiﬁed in k-word categories, where k indicates the number of main items contained in the term. Table 3. Syntactic patterns used to extract k-words candidate terms represented in RegExp Terms Length

Syntactic Patters

1-word 2-word

(noun) (adj) (noun) (noun) (noun) (noun) (prep) (noun) (noun) {3,5} (noun) (prep) (noun) {2,4} (adj) (noun) {2,4}

3,4,5-word

In order to carry out the term extraction process we previously analyzed the corpus document using a modular syntactic parser [5] together with a dedicated term extraction module [31]. Out of the 44.619 candidate terms extracted, 6346 have been retained as true terms by the ESA experts, leading to an overall Precision of 14%. Considering only terms which appear in the corpus more than 5 times Precision increases to 38%, giving a ﬁrst indication that frequency could be an interesting measure to select terms. Then, all the 44.619 candidate terms have been ﬁltered using a generic stop-list of speciﬁc determiners (deﬁnite articles, demonstrative and possessive adjectives) and general determiners (indeﬁnite articles and expressions as few, many, some, etc.). The aim is to discard a priori candidates that, by deﬁnition, can not be considered terms. In facts, determiners are generally deﬁned as “non-descriptive words that have little meaning apart from the nouns they refer to”. As terms should be formed only by meaningful words, candidates containing determiners should be discarded. The stop-list4 has 4

The stop-list comprehend the following words: this, all, some, these, such, any, many, both, those, each, same, own, another, few, several, least, every, more, fewer, much, there, most.

268

M.T. Pazienza et al.

been automatically derived as the most frequent determiners extracted from a generic human-annotated sub-corpus of the British National Corpus. After the determiners stop-list passage, 2556 candidates are ﬁlter out, increasing Precision from 14% to 15%, while Recall decreases only to 99,3%. A second adjective stop-list composed by a list of validated most frequent 200 adjectives of the sub-corpus has been applied in order to verify the value of candidate terms containing generic common adjectives; the intuitive hypothesis is that common adjectives such as same, another, industrial, next, available, military are not enough signiﬁcant to deﬁne a term. Results show a slight increase in Precision (18%) while Recall drops to 81%. A complete list of results using stop-lists is reported in Table 4, both for all terms and for the subclass of 2-word terms. As it can be noticed the subclass of 2-word terms has an higher Precision, whose motivations will be discusses later on. In general, the use of stop-lists seems to improve Precision, having as side eﬀect a decrease in coverage, mostly for terms of more than two words. In the rest of the study the set of terms obtained after the two stop-list ﬁltering will be used for the analysis. It consists of 28.465 terms (among which 5134 validated as true terms) whose characteristics are summarized in Table 5 (excluding 21 spurious terms). Table 4. Precision and Recall using stop lists All Candidate Terms

Before stop-lists After det stop-list After adj stop-list

2-Word Candidate Terms

Precision

Recall

Precision

Recall

F-Measure

14,2% 15% 18%

100% 99,3% 80,9%

43,6% 44,2% 47,1%

100% 100% 86,5%

60,7% 61% 61%

A ﬁrst conclusion from our study can be at this point already drawn: 2word terms seems to be the most important and frequent terms (as already outlined in Sect. 2.2 and [14]), as out of the 5134 true terms 3150 (61,4%) are 2-word. This result is in line with previous analysis carried out in [14], where 56% of terms contained in a hand collected terminology bank are 2-word. For the scope of this study we will thus hereafter focus mainly on 2-word terms retained after the stop-lists ﬁltering. An interesting analysis relates the syntactic structure (i.e. syntactic patterns) of the 2-word terms extracted and validated. In Table 6 the characteristics of 2-word terms classiﬁed by syntactic patterns are shown (of course, referring speciﬁcally to English, for other languages diﬀerent values can be expected). The reported statistic takes into consideration inﬂectional variations, that is, singular and plural forms of nouns are collapsed to a unique term (e.g. spacecraft(s) mission(s)). The most common terms are those of the form adjective-noun, followed by forms noun-noun, both in the extracted set

Terminology Extraction

269

Table 5. Characteristics of terms as obtained after stop-list processing. Precision is intended as the number of correct terms (column 4 ) over the total number of terms of a certain class (column 2 ) Term Class

n. of Terms

% Over the Total

n. of Correct Terms

% of Correct Over Total Correct

Precision

1-word 2-word 3-word 4-word 5-word

6625 16369 4229 978 243

23,3% 57,6% 14,9% 3,4% 0,8%

1177 3150 697 102 8

22,9% 61,4% 13,6% 2% 0,1%

17,8% 19,2% 16,5% 10,4% 3,3%

Table 6. Characteristics of validated 2-word terms by syntactic patterns. Precision is intended as column 4 over column 2 Syntactic Pattern

n. of Terms

% Over 2-Word

n. of Correct Terms

% of Correct Over all 2-Word Correct

Precision

adj noun noun noun noun prep noun spurious

7122 4714 4022 511

43,5% 28,8% 24,6% 3,1%

1363 1206 548 33

43,3% 38,3% 17,4% 1%

19,1% 25,6% 13,6% 6,4%

(column 2) and in the true (i.e., validated) terms set (column 4). Examples of frequent noun-noun terms are application datum, test level, source packet; frequent adjective-noun are magnetic ﬁeld, solar wind, technical requirement. Fewer terms have the form noun-prep-noun (for instance speed of light, factor of safety, satellite in orbit), most of which have “of ” as preposition. Our results are fairly in line with those obtained in [14]. Even though our study has been applied to only one domain, it can be a ﬁrst indication on the performance of linguistic approaches over diﬀerent syntactic patterns for English. It is interesting to notice, for example, that nounnoun terms while constituting only the 28,8% of extracted 2-word terms, are the 38,3% of true 2-word terms, having the overall highest Precision (25,6%). That seems to point out that noun-noun forms are more promising than the others. For what concerns the issue on term variation already discusses in Sect. 2.1, we decided to leave the problem aside. In our view it is diﬃcult to build an a priory methodology based on a linguistic theory able to justify the collapse of term variants in a base term, even in apparently obvious cases such the ”of ” permutation. Collapsing a variant assumes in fact that the variant and the base term convey the same meaning, that is not always true. For example in the “of ” case we found variant-term couple such as list of deﬁnition – deﬁnition list and ﬁeld of view – view ﬁeld which are not completely meaning preserving. The only exceptions exists for inﬂectional variants, since singular-plural

270

M.T. Pazienza et al.

variations on nouns can be roughly considered meaning preserving. In the literature some term recognition approaches take into consideration variations (e.g. [13]) while other prefer to left the problem aside as we do [26]. 3.2 Statistical Step In our approach the set of terms produced by the linguistic analysis is input to a successive statistical process, willing to rank terms according to their termhood or unithood properties. Our statistical analysis is twofold. From one side, a wide debate is still going on what could be the most suitable measure for ranking and selecting terms. In fact, since it is impossible to deﬁne an objective and widely accepted golden standard/benchmark for measuring terminology, it is clearly diﬃcult to establish measure performances and accuracy. As far as we know, only a few studies tried to compare the most adopted measures on a common test bed (e.g., [13, 28]). In our view it is thus necessary to test the diﬀerent measures on many diﬀerent domain corpora in order to clarify and solve this issue. From the other side, we aim to point out the diﬀerent characteristics of the measures we tested, willing to identify which properties of terms they let emerge from the ranks they produce. As test bed for the measures cross-evaluation we use the set of 2-word terms obtained after the linguistic analysis (terms extraction and stop-list ﬁltering). In particular tests will be applied to the 949 terms with a frequency f ≥ 5 (e.g., terms that appear in the corpus 5 or more times). As suggested also in [13] and [20] it is evident that the choice of using a frequency threshold over 2word candidates seems to be the best compromise to obtain a functional set of terms for evaluating measures over a clean test bed. In facts, as demonstrated in [28], statistical methods perform badly when applied to very low frequency objects. As golden standard for evaluation we use the set of true terms validated by the ESA experts among the total of 949. True terms are 447, leading to an overall Precision (after the linguistic step) of 47,1%. We evaluate measures in two steps. Firstly, we apply the method used in [13]. Terms are ranked according to a speciﬁc measure and then divided in equivalence classes of 50 consecutive elements in the ranking. For each class Precision is calculated as the percentage of correct terms in the class. In this view the best statistical measure should be the one able to clearly separate true terms from false ones: that is, the ideal measure should assign the highest positions in the rank to the 447 true terms, leaving the remaining false terms to the lowest part of the ranking (see Fig. 1). In such a way what is evaluated is the power of each measure in discriminating true and false terms. As a second evaluation we simply use the standard method of plotting Precision of a given measure at diﬀerent Recall percentiles. Here, Recall is deﬁned as the percentage of true terms contained in a ranking interval over

Terminology Extraction

271

1 0,8 0,6 0,4 0,2 0

Fig. 1. The curve of an ideal measure to rank terms, with Precision of the measure in the y axis at diﬀerent equivalence classes (x axis). Equivalence classes are order by increasing value of the measure

the total 447 true terms. Precision is thus the percentage of true terms at a given Recall percentile over the total number of terms at the same percentile. We compared some of the most widely used measures for term recognition, focusing on those that need only information about the speciﬁc domain in order to be calculated. That is, we don’t take into consideration measures such as T f ∗ Idf or Domain Relevance that need some sort of corpora comparison. In fact, while comparing the lexical proﬁles of the relevant domain against a generic domain (or a set of diﬀerent domains) appears to be useful in term recognition (since the deﬁnition of term itself underlines the importance of domain speciﬁcity), we want here to restrict our attention to the simplest (and more likely) cases in which only a domain corpus is available. The compared measure are thus: frequency, T-score, MI, MI 3 , Dice Factor, Log Likelihood Ratio (LLR), C-value and Co-occurrence. Results are summarized in the histograms in Fig. 2, in Fig. 3 and Table 7. Table 7. Precision at diﬀerent Recall percentiles for statistical measures. In grey the best value at each speciﬁc percentile Measures Recall Perc.

freq

t-score

mi

mi3

df

llr

c-value

co-occ

0, 1 0, 2 0, 3 0, 4 0, 5 0, 6 0, 7 0, 8 0, 9 1

75,0% 64,3% 54,7% 53,1% 52,4% 52,7% 51,5% 49,7% 47,4% 47,1%

70,3% 63,8% 54,2% 54,5% 51,8% 51,7% 50,6% 49,3% 46,9% 47,1%

30,4% 38,1% 39,4% 40,4% 42,6% 44,2% 44,7% 45,2% 45,7% 47,1%

45,9% 47,1% 45,8% 46,0% 46,1% 46,0% 45,3% 46,6% 46,0% 47,1%

39,8% 43,5% 45,9% 46,5% 46,6% 45,6% 46,5% 46,4% 45,5% 47,1%

58,4% 60,0% 51,9% 50,6% 48,3% 47,4% 47,2% 47,9% 47,5% 47,1%

70,3% 63,8% 54,4% 54,1% 52,4% 52,2% 51,0% 49,0% 47,3% 47,1%

33,8% 38,8% 41,9% 45,2% 47,1% 47,3% 47,9% 48,4% 47,5% 47,1%

272

M.T. Pazienza et al.

0,8

0,8

0,6

0,6

0,4

0,4

0,2

0,2

0

0

Frequency

Dice-Factor

0,8

0,8

0,6

0,6

0,4

0,4

0,2

0,2

0

0

MI 3

MI 0,8

0,8

0,6

0,6

0,4

0,4

0,2

0,2

0

0

Log Likelihood Ratio

T-Score

0,8

0,8

0,6

0,6

0,4

0,4

0,2

0,2

0

0

C-Value

Co-Occurrence

Fig. 2. Precision of diﬀerent measures (y axis) at diﬀerent equivalence classes (x axis). Equivalence classes are order by increasing value of the measure

Terminology Extraction

273

0,8

freq

0,7

t-score mi

0,6

mi3 df

0,5

llr c-value

0,4

co-occ 0,3 0,2 1

2

3

4

5

6

7

8

9

10

Fig. 3. Overall Precision of diﬀerent measures (y axis) at diﬀerent Recall percentiles (x axis)

Analysis of Results By a ﬁrst look to histograms in Fig. 2 it emerges that a pool of measures seems to have an interesting behaviour, compared to what should be the ideal measure. In particular, frequency, C-value and T-score have an overall decreasing trend, indicating that for lower values of the measure Precision decreases. This behaviour suggests that these three measures tend to assign higher value to true terms: so, the better a term is ranked by the measure, higher is the probability of being a true term. Frequency, C-value and T-score seem thus measures able to discriminate in some way among terms and to produce a signiﬁcant rank. Other measures show approximately a ﬂat curve, thus revealing to be poor statistics for recognizing terms. In particular it is interesting to notice how degree of association measures (i.e., MI and Dice Factor ) are characterized by a curve that grows in the ﬁrst equivalence classes. That is, these measures tend to behave badly in the higher part of the rank (many of the terms with highest score are false terms). The reason of this behaviour lies in the already mentioned (see Sect. 2.2) problem of low frequency that aﬄicts MI and Dice Factor : these measures give a too high score to rare events (e.g. to terms composed by rare words) (that could be useful for recognizing very rare terms appearing in large document collections). To clarify Table 8 shows the ﬁrst 20 ranked terms by MI and frequency, together with the occurrences values of the term words in the corpus. As it can be noticed, MI tends to rank higher terms composed by words with low frequency: those terms, while having a high association score, are usually not interesting, since they are very rare linguistic expressions of the corpus. On the contrary, the most relevant terms according to frequency have been successfully validated by the experts, suggesting that a recurrent expression is in fact a good term.

274

M.T. Pazienza et al.

Table 8. 20 higher ranked 2-word terms by MI (left) and by frequency (right). R1, C1 and O11 are respectively the occurrences of the ﬁrst word, the second word and the term. Experts validation (True or False terms) is in the last column Term tape recorder extension of maximum additive for processing scan platform circuit board adaptive routing capacity of spur nic ﬂuctuation audible noise million of dollar industry association cleaning agent destination identiﬁers remote sensing statement of eﬀectivity accordance with subclause imaginary circle pound of payload behavioural view look-up table

O11 R11 C11 val Term 6 6 6 6 6 7 5 5 6 5 12 5 5 5 9 15 5 8 5 6

6 9 8 9 7 10 12 14 12 6 13 10 9 5 18 19 12 12 14 20

O11 R11 C11 val

1 F application datum 122 581 510 T 3 F magnetic ﬁeld 104 246 231 T 3 F solar wind 101 119 483 T 4 F technical requirement 83 1000 1098 T 4 T test level 69 355 677 T 4 T source packets 61 147 173 T 7 F source datum 60 581 609 F 9 F normative document 59 108 53 F 7 F technical speciﬁcation 58 156 304 F 7 F launch vehicle 53 104 140 T 3 F mechanical part 50 142 187 T 8 F mission phase 50 135 267 T 8 F test requirement 48 1000 1365 T 12 T performance requirement 47 1000 1014 T 9 F user manual 46 65 28 F 4 F ﬂight operation 43 351 387 T 10 F propulsion system 42 402 386 T 9 F gray system 41 402 500 F 11 F sub-service provider 40 43 22 F 14 F engineering process 39 204 264 T

Comparing the histograms of MI and MI 3 it can be noticed how the latter measure seems to act successfully in removing the problem of low frequency (as indicated in [13]) for the ﬁrst equivalence classes; notwithstanding, MI 3 doesn’t seem to be interesting anyway, being characterized for the rest by the same ﬂat curve as MI. Interestingly, LLR, that has been proved to be a useful measure for term recognition in other studies (e.g., [13, 16]), doesn’t seem to give the same indication in our experiment (see not well characterized curve in Fig. 2) even though it presents a slightly decreasing trend. Summing up, there isn’t a measure that presents an histogram function comparable to the ideal curve, but however, some of them are able to produce a rank in which the probability of having correct terms is higher in higher position of the rank. Results obtained for the second evaluation are reported in Fig. 3 and Table 7. A ﬁrst analysis of the Precision curve reveals a neat distinction in two curve classes. Indeed, a group of measures starts with a high Precision (between 60–75% at the ﬁrst percentile) and then decreases quite substantially. On the contrary, a second group starts with very low Precision (between 30– 45%) and then slightly increases. It is interesting to notice that the ﬁrst group comprehends frequency, T-score, LLR and C-Value, the second MI, MI 3 , Dice

Terminology Extraction

275

Factor and Co-occurrence. The ﬁrst group is thus composed by measures strongly based on frequency (C-value and frequency itself) and signiﬁcance of association measure (T-score and LLR). All these measures outperform the second group almost at all percentile, indicating that frequency and the statistical null-hypothesis of independence assumption are better means to rank and recognize terms compared to the probability parameters approximation methods used by the degree of association measures. The neat low values of some of these latter measures at the ﬁrst percentiles is again an evidence of their low-frequency problem. Also from this second evaluation method frequency emerges as the best measure, since its Precision is higher at almost all percentiles, while the worst measure appears to be MI. Moreover, it emerges that theoretically similar measures such as MI, MI 3 and Dice Factor have diﬀerent behaviours. In particular, MI 3 performs better at the beginning (thanks to the solved lowfrequency problem) and then becomes similar to Dice Factor, while MI seems to remain quite apart at lower Precision values. Considering the results of the two evaluations it can be noticed how significance of association measures (T-score and LLR) perform better than degree of association measure (MI, MI 3 and Dice Factor ), while a few of the heuristic measures have good performances (such as frequency). In theory, it could be justiﬁed by the diﬀerent statistical methodologies that degree and signiﬁcance measures use to calculate the association score: it would thus emerge that it is better to adopt methods that use the null hypothesis of independence rather than those that try to only approximate probability parameters with MLE. For what concerns the other statistical dimension, no ﬁnal conclusion can be drawn about the statistical behaviours of measures of termhood and unithood, since measures curves don’t seem to be characterized by these properties. Notwithstanding, an interesting linguistic analysis is to compare the highest ranked terms by the best measures of termhood and unithood, in order to see how look like terms with high termhood and terms with high unithood. In Table 9 the ﬁrst 20 terms are reported for the best measure of termhood (frequency) and the two best measures of unithood (LLR and T-score). At ﬁrst glance it can be noticed how the ﬁrst three terms in the rank are common for the three measures, while going down in the ranking, agreement decreases, suggesting a certain stability among measures in selecting higher terms. Frequency has 17 terms in common with T-scores, and only 11 with LLR, while T-score and LLR 13: that seems to conﬁrm no practical importance in the measures classiﬁcation into the linguistic dimension termhood-unithood. In conclusion, frequency appears the best measure (as conﬁrmed in [13] and [20]), followed by T-score and C-value. LLR doesn’t show good performances as in other studies, while behaving better than MI, MI 3 and Dice Factor, whose recognition power seems substantially poor. Co-occurrence poor results appear to indicate that at a ﬁrst analysis information about terms cooccurrence in text is not an interesting property to distinguish true form false

276

M.T. Pazienza et al. Table 9. Higher ranked 2-word terms by frequency, LLR and T-score

Freq

LLR

T-score

application datum magnetic ﬁeld solar wind technical requirement test level source packets source datum normative document technical speciﬁcation launch vehicle mechanical part mission phase test requirement performance requirement user manual ﬂight operation propulsion system gray system sub-service provider engineering process

magnetic ﬁeld application datum solar wind normative document abbreviated term user manual sub-service provider source packets launch vehicle electromagnetic radiation architectural design mechanical part technical speciﬁcation parameter statistic telecommand packets mission phase logical address minimum capability pressure vessel functional test

application datum magnetic ﬁeld solar wind technical requirement test level source packets source datum normative document functional test technical speciﬁcation abbreviated term launch vehicle electromagnetic radiation mechanical part mission phase test requirement performance requirement user manual ﬂight operation propulsion system

terms; that is, terms don’t seem to have the property of appearing together, concentrating in speciﬁc section of texts. Taking into consideration computational complexity, the position of frequency gets even stronger, being its computational cost irrelevant, since frequency can be calculated as the occurrence of terms in the corpus during the linguistic step. The other interesting measures, T-score C-value and LLR, while being comparable to frequency in term of recognition performance, are not from a computational point of view. Their good recognition power is thus overridden by their computational cost.

4 Conclusions In this paper it has been widely analyzed the problem of term recognized task in an automatic process, also by considering the continuously growing interest in terminology as a useful hint for ontology learning as well as for supporting Semantic Web. This required to converge to an operational deﬁnition of term (to be eﬀective in an extraction system) and to agree on the need of both linguistic and numerical knowledge for systems with such an ability. The minimal set of needed linguistic process has been underlined and described in a general architecture for terminology extraction. Then , a large set

Terminology Extraction

277

of widely adopted statistical measures have been applied and comparatively evaluated in order to determine their role in improving terminology extraction. A real corpus has been used to produce a list of candidate terms and a related evaluation carried on them has been possible thanks to a parallel manual evaluation produced by human experts. The overall system performances have been compared with state of the art results, showing its higher reliability. As a last point, authors will underline that with our approach we are able to recognize (in the processed corpus) linguistic expressions that are real terms while not being validated by the expert interested in a tight speciﬁc application domain (e.g., tape recorder ). This is due to the fact that assuming “the corpus as containing only terms related to the application domain” is not totally correct: the jargon of the writers covers, in facts, a wider context than the speciﬁc domain of interest.

References 1. Ananiadou, S., Maynard D.: Identifying contextual information for term extraction. In Proc. of 5th International Congress on Terminology and Knowledge Engineering (1999) 2. Basili, R., Pazienza, M.T., Velardi. P.: An Empirical Symbolic Approach to Natural Language Processing. Artiﬁcial Intelligence, Vol. 85 (1996) 3. Basili, R., De Rossi, G., Pazienza M.T.: Inducing Terminology for Lexical Acquisition. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2),Brown University, Providence, Rhode Island (1997) 4. Basili, R., Bordoni, L., Pazienza, M.T.: Extracting terminology from corpora, Proc. of the 2nd International Conference on Terminology, Standardization and Technology Transfer (1997) 5. Basili, R., Pazienza, M.T., Zanzotto, F.M.: Customizable modular lexicalized parsing. In: Proc. of the 6th International Workshop on Parsing Technology (2000) 6. Basili, R., Missikoﬀ, M., Velardi, P.: Identiﬁcation of relevant terms to support the construction of Domain Ontologies, ACL workshop on HLT, Toulouse, France. (2001) 7. Basili, R., Pazienza, M. T., Zanzotto, F. M.: Decision trees as explicit domain term deﬁnition 19th International Conference on Computational Linguistic (COLING2002). Taipei (Taiwan) (2002) 8. Benveniste, E.: Probl`emes de linguistique g´en´erale. Gallimard (1966) 9. Bourigault, D.: Surface grammatical analysis for the extraction of terminological noun phrases. In: Proc. of Fifteenth International Conference on Computational Linguistics (1992) 10. Brill, E.: Some advances in transformation-based part-of-speech tagging. In Proceedings of the 15th International Conference on Computational Linguistic, 1034-1038 (1994) 11. Church, K.W., Hanks, P.: Word Association Norms, Mutual Information and Lexicography. ACL (1989), 76–83

278

M.T. Pazienza et al.

12. Church, K.W., Gale, E., Hanks, P., Hindle, D.: Using statistics in lexical analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon, Lawrence Erlbaum. (1991) 13. Daille, B.: Approach mixte pour l’extraction de termilogie: statistique lexicale et ﬁlters linguistiques. PhD Thesis, C2V, TALANA, Universit`e Paris VII (1994) 14. Daille, B., Habert, B., Jacquemin, C., Royaut, J.: Empirical observation of term variations and principles for their description. Terminology, 3(2) (1996) 197–258 15. Dennis, Sally, F.: The construction of a thesaurus automatically from a sample of text. In Proceedings of the Symposium on Statistical Association Methods For Mechanized Documentation, Washington, DC. (1965) 61–148 16. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1) (1994) 61–74 17. Earl, L.L.: Experiments in Automatic Extracting and Indexing. Information Storage and Retrieval 6(X) (1970) 273–288 18. Enguehard, C., Pantera, L.: Automatic Natural Language acquisition of a terminology. Journal of Quantitative Linguistics 2(1) (1994) 27–32 19. Evans, D.A., Zhai, C.: Noun-phrase analysis in unrestricted text for information retrieval, Proceedings of the 34th conference on Association for Computational Linguistics. Santa Cruz, California (1996) 17–24 20. Evert, S., Krenn, B.: Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France. (2001) 188–195 21. Fano, R.M.: Transmission of Information: A statistical Theory of Communications. MIT Press, Cambridge, MA. (1961) 22. Frantzi, K.T., Ananiadou, S.: Extracting Nested Collocations. COLING (1996). 41–46 23. Hisamitsu, T., Tsujii, J.: Measuring Term Representativeness. Third Summer Convention on Information Extraction (SCIE 2002). Roma, Italy (2002) 24. Jacquemin, C.: Variation terminologique: Reconnaissance et acquisition automatiques de termes et de leurs variantes en corpus. M´emoire d’Habilitation ` a Diriger des Recherches en informatique fondamentale, Universit`e de Nantes, France (1997) 25. Jones, L.P., Gassie, E.W., Radhakrishnan, S.: INDEX: The statistical basis for an automatic conceptual phrase-indexing system. Journal of the American Society for Information Science 41(2) (1990) 87–97 26. Justeson, J., Katz S.: Technical Terminology: some linguistic properties and an algorithm for identiﬁcation in text. In: Natural Language Engineering, 1 (1995) 9-27 27. Kageura, K., Umino, B.: Methods of automatic term recognition. Terminology, 3(2). (1996) 28. Krenn, B.: Empirical Implications on Lexical Association Measures. Proceedings of The Ninth EURALEX International Congress. Stuttgart, Germany. (2000) 29. Nakagawa, H., Mori, T.: Automatic term recognition based on statistics of compound nouns and their components. Terminology 9(2):201 (2003) 30. Pazienza, M.T.: A domain speciﬁc terminology extraction system. In: International Journal of Terminology. Benjamin Ed., Vol. 5.2 (1999) 183-201 31. Pazienza, M.T., Pennacchiotti, M., Vindigni, M., Zanzotto, F.M.: Shumi, Support To Human Machine Interaction. Technical Report. ESA-ESTEC contract N.18149/04/NL/MV – Natural Language Techniques in Support of Spacecraft Design (2004)

Terminology Extraction

279

32. Salton, G., Yang, C.S., Yu, C.T.: A Theory of term importance in automatic text analysis. In: Journal of the American Society for Information Science 26(1) (1975) 33–44 33. Smadja, F.A., McKeown, K., Hatzivassiloglou, V.: Translating collocations for bilingual lexicons: a statistical approach. Computational Linguistics, 22:1. (1996) 34. Zanzotto, F.M.: L’estrazione della terminologia come strumento per la modellazione di domini conoscitivi. PhD Thesis, Universit`a degli Studi di Roma Tor Vergata (2002)

Analysis of Biotechnology Patents Antoine Spinakis and Asanoula Chatzimakri QUANTOS SARL, 154 Sygrou Av., 176 71 Athens, Greece [email protected]

Abstract. In this paper is presented the process and the main conclusions of analysis in Biotechnology Patents, which was applied with Sting software. The analysis was applied on biotechnology patents that were consolidated during the years 1995– 2003

1 Introduction Biotechnology is a technological sector that has had a tremendous growth during the past decade. It can be considered as a technology at its peak of development and in the centre of interest of scientists and companies. Consequently, those who are actively involved in the ﬁeld of biotechnology require the estimation of the existing situation and of the technological innovation in a reliable and scientiﬁc way [1, 7, 8, 9, 10]. In order to study the situation in the particular technological sector worldwide and particularly in Europe, we analysed patents, which were certiﬁed during the years 1995–2003. The analysis has been done with the use of STING software, which is specialised in the analysis of patents. The analysis of patents is based on the usage of simple statistics and multidimensional techniques, such as Correspondence Analysis, Factor Analysis and Cluster Analysis. For the particular research 2064 patents have been used.

2 Objectives of the Analysis The main objectives of the particular case study were the following: • The exploration of evolution in Biotechnology Patents Consolidation per Geoﬁgureical Region. • The rhythm of Biotechnology Patents consolidation through time. • Identify the countries that have an increased in Biotechnology sector. A. Spinakis and A. Chatzimakri: Analysis of Biotechnology Patents, StudFuzz 185, 281–290 (2005) c Springer-Verlag Berlin Heidelberg 2005 www.springerlink.com

282

A. Spinakis and A. Chatzimakri

• Explore the inﬂuence that particular sciences have in the development of Biotechnology. • Creation of homogeneous clusters composed of common methodologies and applications

3 Description of Applied Methodology For the speciﬁc analysis, concrete factors were selected as fundamental axes for the observation of the developments in the ﬁeld of biotechnology. More concretely, variables such as: 1) the year of the Consolidation of Patents, 2) the year of the deposit of applications for the consolidation of an invention, 3) the country where an invention was certiﬁed, 4) the country of origin of an application for the consolidation of license and 5) the scientiﬁc frame in which an invention was developed, constituted the basic elements for the recuperation of technology indicators in the area of biotechnology during the past 8 years [5]. The analysis was based on Simple Statistics and Multidimensional Techniques. In more detail the analysis process was performed as follows: 3.1 Linguistic Preprocessing Once the biotechnology patents were imported in Sting software, the linguistic processing of the data was performed. As far as the preprocessing of the textual data is concerned, lemmatization and part-of-speech assignment was automatically performed. Lemmatization consists in restricting the morphologic variation of the textual data by reducing each of the diﬀerent inﬂections of a given word form to a unique canonical representation (or lemma). In order to reduce further the vocabulary size, it is possible to restrict the analysis to those speciﬁc word categories (as identiﬁed by the assigned parts-of-speech) that bear most of the semantic content: nouns, verbs and symbols that contain adjectives and all other grammatical forms. Furthermore, a combination of all these can be used. It is also remarkable that the user can select to include in the analysis either the words existing in titles or abstracts or a combination of these. The above-mentioned process is illustrated in the Fig. 1. In order to reduce further the vocabulary size, we restricted the analysis to speciﬁc words by using as ﬁltering criteria the frequency of the words and their syntactic role. Finally 1115 lemmas were analyzed. 3.2 Simple Statistics Once a vocabulary was constructed, Simple Statistics were applied for the construction of various indicators. Those indicators provided information related

Analysis of Biotechnology Patents

283

Fig. 1. Basic Steps of Linguistic Processing through STING

to the evolution Biotechnology sector through time or per geoﬁgureical region. In addition, the contribution of particular scientiﬁc domains in biotechnology was also measured. 3.3 Multidimensional Techniques Correspondence Analysis: In traditional approaches to patent analysis, knowledge about the content of a patent is usually restricted to (parts of) the IPC codes that have been assigned to that patent during the application process. Although the IPC represent a quite rich hierarchy of codes, not taking into account the textual content of the patent may be considered as a limitation to a fully eﬃcient exploitation of patent data. Therefore, one of the important characteristics of the presented methodology is the integration of textual data analysis techniques for the processing of the textual content of the patents (titles and abstracts). The underlying idea is that the vocabulary, which is characteristic for patent classes, built on the basis of various descriptive variables associated with the patents such as country or date of application provides additional interesting insights for the analysis of the patents themselves [3]. Cluster Analysis: In the case of textual data, clustering techniques are used for representing proximities between the elements of lexical tables. In the general case, cluster analysis operates on contingency tables to identify relationships between two diﬀerent nominal variables. In the case of patent data, the aim of the procedure is to identify groups of technologies that share

284

A. Spinakis and A. Chatzimakri

common vocabulary and groups of patents that share common technologies in order to derive conclusions about technological trends and innovation [6]. More speciﬁcally, we apply cluster analysis to contingency tables cross-tabulating full IPC codes and words in order to identify homogeneous groupings of and relevant relationships between grouping of IPC codes (resp. patents). The information captured in such clusters and inter-cluster relationships can then be directly used for the production of technological indicators, the goal being to identify areas of technology that share common characteristics, as well as innovative areas characterized by isolated clusters. In the last stage of the analysis we applied clustering techniques. A cluster map rrepresented homogeneous groups of sectors in which biotechnology is applied. The particular map represented also relationships among the generated groups.

4 Basic Conclusions Observing the situation in the ﬁeld of biotechnology, we realize that the development of the Consolidation of Licenses per Geoﬁgureic Region is remarkable, since it provides a complete picture of the diachronic observation of consolidations both on European and on an international level. Figure 1 presents the results of the speciﬁc analysis. The above diagram shows that the percentage of new technologies, related to biotechnology, is less in Europe than in the rest of the world. Furthermore, a systemization is noticed that has to do with the consolidation of patents. More speciﬁcally, it appears that in periods where a big percentage of licenses were guaranteed worldwide, the corresponding percentage in Europe was low. In addition, when the number of new technologies in Europe was increased, in the rest of the world it was decreased. Finally, we observe that during the past two years (2002–2003), biotechnology in Europe has had an ascending course, while in the rest of the world it was bending. Another interesting aspect of the analysis was the study of the rhythm of licenses’ consolidation per year. Thus, through the analysis with STING, the following ﬁgureic was composed which presents the year of realization of an invention according to the year of its application of consolidation. Based on the above results, it seems that from the moment of the application for the acquisition of a Patent Certiﬁcate until the moment that the particular license was guaranteed, oﬃcially, we had an interval of almost a 3year period. Consequently, in the ﬁeld of biotechnology, the new technologies appeared with a delay of 3 years from the moment of their materialization. However, Fig. 2 shows that since 2001 there was a tremendous growth in the rhythm of the consolidation of new technologies, so that, in 2003, the 56,19% of Patents concerned solutions that were proposed hardly a year ago. In addition, STING software being a useful tool for operational analysts and investors was used for the study of activation of countries, both in Europe

Analysis of Biotechnology Patents

285

Fig. 2. Evolution of Patent Consolidation per Geoﬁgureical Region

and worldwide, in the sector of biotechnology. The results of the analysis appear in the ﬁgure that follows. From the above ﬁgure it is obvious that the USA dominated in the ﬁelds of research and presentation of technological proposals concerning biotechnology. However, as time passed it was observed a bending course in the activation of the USA in biotechnology and a progressive entry of new countries in terms of proposals in the particular technological ﬁeld. Additionally, Great Britain, France and Japan develop an intense activity in the ﬁeld of biotechnology, while Greece is still in low levels in relation to other countries. With STING it was possible to study the contribution of concrete scientiﬁc ﬁelds in the development of new technologies related to biotechnology. These scientiﬁc ﬁelds are the following: 1. 2. 3. 4. 5.

(Recuperation) of Human Needs Chemistry & Metallurgy Constructions Weaving and Paper-industry Physics

The results of the analysis are shown in Fig. 4 In regard to the consolidation of a new invention in Biotechnology, within a speciﬁc year, both in Europe and worldwide, the percentages of Patents that were developed within the frames of Human Needs and Chemistry and Metallurgy were particularly high. During the past 8 years, the sectors of Human Needs and Chemistry and Metallurgy appeared to be the dominant scientiﬁc ﬁelds in the service of biotechnology. Consequently, licenses that were

286

A. Spinakis and A. Chatzimakri

Fig. 3. Rhythm of Consolidation of Patent Certiﬁcates

Fig. 4. Activity of countries in Biotechnology

developed on the basis of these sectors were more likely to be established as the new technologies in biotechnology. Finally, an additional part of the analysis of patents in biotechnology had to do with the creation of homogeneous clusters that were composed of common methodologies and applications. More concretely, these clusters structure

Analysis of Biotechnology Patents

Fig. 5. Contribution of Scientiﬁc ﬁelds in Biotechnology

Fig. 6. Map of clusters in Biotechnology

287

288

A. Spinakis and A. Chatzimakri

a map of applications in the sector of biotechnology that presents homogeneous regions in respect of the object of research and development within them. According to the analysis in the lexicoﬁgureical data of 2046 Patents, the following technological map was shaped for the homogeneous clusters. For each structured cluster there is a descriptive title as well as a short presentation in terms of its function. From the above map it seems that certain clusters are connected with one another mainly because they develop technologies in common scientiﬁc ﬁelds of biotechnology. A short description of the shaped clusters follows.

Final Clusters of the Biotechnology Map • CLUSTER 1 Production of vaccines for the confrontation of transmitted illnesses like chickenpox, herpes and so but also for non-transmitted illnesses like peritonitis, Altsheimer, cancer and allergies. • CLUSTER 2 Production of pharmaceutical products for the detection and confrontation of diseases. • CLUSTER 3 Methodologies and procedures to ﬁght diseases like hepatitis, polio, newDarwinism as well as for the confrontation of malignant tumors. • CLUSTER 4 Methodologies, where plant and animal derivatives are used, for the production of chemical substances against diseases. • CLUSTER 5 Production of substances for the support of the immune system of humans and plants. • CLUSTER 6 Methodologies for the production of pharmaceutical substances for the confrontation of diseases related with cholesterol. • CLUSTER 7 Methods of fermentation for the production of lactic acid and other special acids as well as for the segregation of substances of low molecular structure. • CLUSTER 8 Methods of production of chemical substances and of confrontation of diseases in Molecular Biology. • CLUSTER 9 Production of medicines against animal diseases that aﬀect humans as well as medicines against infections. • CLUSTER 10 Methods of protection of humans and animals against diseases.

Analysis of Biotechnology Patents

289

• CLUSTER 11 Production of chemical substances that protect cells against destruction, HIV, clogging, etc. • CLUSTER 12 Use of cereals in the fermentation process and methods of producing them as components of the particular process. • CLUSTER 13 Production of substances against antifungal • CLUSTER 14 Growth of biological products for commercial use, i.e.. production of chemical substances for the amelioration of fermentation, confrontation of industrial and urban wastes • CLUSTER 15 Development of fermentation methods for the production of alcoholic drinks and pharmaceutical substances as well as methodologies of production of ferments from raw material. The basic conclusions of the analysis can be summarized as follows: • The percentage of new technologies, related to biotechnology, is less in Europe than in the rest of the world. • It appears that in periods where a big percentage of licenses were guaranteed worldwide, the corresponding percentage in Europe was low. • Finally, we observe that during the past two years (2002-2003), biotechnology development in Europe has had an ascending rate, while in the rest of the world it was bending. • Since 2001 there was a tremendous growth in the rhythm of the consolidation of new technologies, so that, in 2003, the 56.19% of Patents concerned solutions that were proposed hardly a year ago. • It was observed a bending course in the activation of the USA in biotechnology trough time and a progressive entry of new countries in terms of proposals in the particular technological ﬁeld. • During the past 8 years, the sectors of Human Needs and Chemistry and Metallurgy appeared to be the dominant scientiﬁc ﬁelds in the service of biotechnology • The cluster map illustrates 15 groups of biotechnology application areas. Although there are clusters, which are related to others, there are also and individual clusters, which represent areas of applications not associated with other areas.

References 1. Benz´ecri J.-P. et al. L’Analyse des Donn´ees, volume II: L’Analyse des Correspondances. Dunod, (1973) 2. Comanor W.S. and Scherer F.M. Patent statistics as a measure of technical change. Journal of political economy, 77(3): 392–398, (1969).

290

A. Spinakis and A. Chatzimakri

3. Davison A. C. and Hinkley D. V. Bootstrap Methods and their Application. Cambridge University Press, (1997) 4. Dou H. Veille technologique et competitivite-L’intelligence economique au service du developpement industriel. Dunod Paris, (1995) 5. Griliches, Zvi; Pakes, Ariel and Hall, Bronwyn H. “The Value of Patents as Indicators of Inventive Activity,” P. Dasgupta and P. Stoneman (eds). Economic Policy and Technological Performance. Cambridge, England: Cambridge University Press, 97–124 (1987) 6. Guellec D. and van Pottelsberghe B. New indicators from patent data. In Proc. of Joint NEST/TIP/GSS Workshop, (1998) 7. Johnson R. and Wichern D. Applied multivariate statistical analysis. PrenticeHall, Inc., (1998) 8. Lebart L., Morineau A., and Piron M. Statistique exploratoire multidimentionnelle. Dunod, 2 edition, (1997) 9. Lebart L. Salem A. B. L. Exploring Textual Data, volume 4. Kluwer Academic Publishers, (1998) 10. Rajman M., Peristera V., Chappelier J-C., Seydoux F., Spinakis A. Evaluation of Scientiﬁc and Technological Innovation using statistical analysis of patents. In 6es Journees internationals d’analyse statistique des donnees textuelles (JADT), France (2002)