Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
6657
Cory Butz Pawan Lingras (Eds.)
Advances in Artificial Intelligence 24th Canadian Conference on Artificial Intelligence, Canadian AI 2011 St. John’s, Canada, May 25-27, 2011 Proceedings
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Cory Butz University of Regina, Department of Computer Science 3737 Wascana Parkway, Regina, Saskatchewan, Canada S4S 0A2 E-mail:
[email protected] Pawan Lingras Saint Mary’s University, Department of Mathematics and Computing Science Halifax, Nova Scotia, Canada B3H 3C3 E-mail:
[email protected]
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-21042-6 e-ISBN 978-3-642-21043-3 DOI 10.1007/978-3-642-21043-3 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011926783 CR Subject Classification (1998): I.3, H.3, I.2.7, H.4, F.1, H.5 LNCS Sublibrary: SL 7 – Artificial Intelligence
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface This volume contains the papers presented at the 24th Canadian Conference on Artificial Intelligence (AI 2011). The conference was held in St. John’s, Newfoundland and Labrador, during May 25–27, 2011, and was collocated with the 37th Graphics Interface Conference (GI 2011), and the 8th Canadian Conference on Computer and Robot Vision (CRV 2011). The Program Committee received 81 submissions for the main conference, AI 2011, from across Canada and around the world. Each submission was reviewed by a minimum of four and up to five reviewers. For the final conference program and for inclusion in these proceedings, 23 regular papers, with allocation of 12 pages each, were selected. Additionally, 22 short papers, with allocation of 6 pages each, were accepted. Finally, 5 papers from the Graduate Student Symposium appear in the proceedings, each of which was allocated 4 pages. The conference program featured three keynote presentations by Corinna Cortes, Head of Google Research, New York, David Poole, University of British Columbia, and Regina Barzilay, Massachusetts Institute of Technology. One pre-conference workshop on text summarization, with its own proceedings, was held on May 24, 2011. This workshop was organized by Anna Kazantseva, Alistair Kennedy, Guy Lapalme, and Stan Szpakowicz. We would like to thank all Program Committee members and external reviewers for their effort in providing high-quality reviews in a timely manner. We thank all the authors of submitted papers and the authors of selected papers for their collaboration in preparation of the final copy. The conference benefited from the practical perspective brought by the participants in the Industry Track session. Many thanks to Svetlana Kiritchenko, Maria Fernanda Caropreso, and Cristina Manfredotti for organizing the Graduate Student Symposium, and chairing the Program Committee of the symposium. The coordinating efforts of General Workshop Chair Sheela Ramanna are much appreciated. We express our gratitude to Jiye Li for her efforts in compiling these proceedings as the Proceedings Chair. We thank Wen Yan (Website Chair), Atefeh Farzindar (Industry Chair) and Dan Wu (Publicity Chair), for their time and effort. We are in debt to Andrei Voronkov for developing the EasyChair conference management system and making it freely available to the academic world. EasyChair is a remarkable system with functionality that saved us a significant amount of time. The conference was sponsored by the Canadian Artificial Intelligence Association (CAIAC), and we thank the CAIAC Executive Committee for the constant support. We would like to express our gratitude to John Barron, the AI/GI/CRV General Chair, Andrew Vardy, the AI/GI/CRV Local Arrangements Chair, and Orland Hoeber, the AI Local Organizing Chair, for making AI/GI/CRV 2011 an enjoyable experience. March 2011
Cory Butz Pawan Lingras
Organization
AI/GI/CRV 2011 General Chair John Barron
University of Western Ontario
AI Program Committee Chairs Cory Butz Pawan Lingras
University of Regina Saint Mary’s University
AI/GI/CRV Local Arrangements Chair Andrew Vardy
Memorial University of Newfoundland
AI Local Organizing Chair Orland Hoeber
Memorial University of Newfoundland
Graduate Student Symposium Chairs Svetlana Kiritchenko Maria Fernanda Caropreso Cristina Manfredotti
National Research Council Defence R&D University of Regina
AI 2011 Program Committee Esma Aimeur Massih Amini Aijun An Xiangdong An Dirk Arnold Salem Benferhat Petra Berenbrink Sabine Bergler Virendra Bhavsar Cory Butz Maria Fernanda Caropreso Colin Cherry David Chiu
Chris Drummond Marek Druzdzel Zied Elouedi Larbi Esmahi Atefeh Farzindar Paola Flocchini Michel Gagnon Qigang Gao Yong Gao Dragan Gasevic Ali Ghorbani Cyril Goutte Kevin Grant
VIII
Organization
Lyne Da Sylva Joerg Denzinger Orland Hoeber Jimmy Huang Frank Hutter Diana Inkpen Christian Jacob Nathalie Japkowicz Richard Jensen Maneesh Joshi Igor Jurisica Vlado Keselj Svetlana Kiritchenko Ziad Kobti Grzegorz Kondrak Leila Kosseim Adam Krzyzak Philippe Langlais Guy Lapalme Oscar Lin Pawan Lingras Hongyu Liu Jiming Liu Alejandro Lopez-Ortiz Simone Ludwig Alan Mackworth Anders L. Madsen Cristina Manfredotti Yannick Marchand Robert Mercer Evangelos Milios David Mitchell Sushmita Mitra Malek Mouhoub David Nadeau Eric Neufeld Roger Nkambou Sageev Oore Jian Pei Gerald Penn Laurent Perrussel Fred Popowich Bhanu Prasad Doina Precup
Howard Hamilton Robert Hilderman Sheela Ramanna Robert Reynolds Denis Riordan Samira Sadaoui Eugene Santos Anoop Sarkar Jonathan Schaeffer Oliver Schulte Mahdi Shafiei Mohak Shah Weiming Shen Mike Shepherd Daniel L. Silver Shyamala Sivakumar Dominik Slezak Marina Sokolova Luis Enrique Sucar Marcin Szczuka Stan Szpakowicz Ahmed Tawfik Choh Man Teng Eugenia Ternovska Thomas Tran Thomas Trappenberg Andre Trudel Peter van Beek Paolo Viappiani Hai Wang Harris Wang Xin Wang Dunwei Wen Rene Witte Dan Wu Yang Xiang Jingtao Yao Yiyu Yao Jia-Huai You Haiyi Zhang Harry Zhang Xiaokun Zhang Sandra Zilles Nur Zincir-Heywood
Organization
External Reviewers Connie Adsett Ameeta Agrawal Aditya Bhargava Solimul Chowdhury Elnaz Delpisheh Alban Grastien Franklin Hanshar Hua He Michael Horsch Yeming Hu Ilya Ioshikhes Michael Janzen Hassan Khosravi Marek Lipczak Stephen Makonin
Yannick Marchand Marie-Jean Meurs Felicitas Mokom Majid Razmara Fatemeh Riahi Maxim Roy Shahab Tasharrofi Milan Tofiloski Baijie Wang Jacek Wolkowicz Xiongnan Wu Safa Yahi Qian Yang Martin Zinkevich
Graduate Student Symposium Program Committee Ebrahim Bagheri Julien Bourdaillet Scott Buffet Maria Fernanda Caropreso Kevin Cohen Diana Inkpen Nathalie Japkowicz Svetlana Kiritchenko Guy Lapalme Bradley Malin
Cristina Manfredotti Stan Matwin Fred Popowich Mohak Shah Marina Sokolova Bruce Spencer Stan Szpakowicz Jo-Anne Ting Paolo Viappiani
Sponsoring Institutions and Companies Canadian Artificial Intelligence Association/Association pour l’intelligence artificielle au Canada (CAIAC) http://www.caiac.ca Memorial University http://www.mun.ca/ Compusult (Gold sponsor) http://www.compusult.net/
IX
X
Organization
Palomino System Innovations Inc. http://www.palominosys.com University of Regina http://www.uregina.ca/ Saint Mary’s University http://www.smu.ca/ NLP Technologies Inc. http://nlptechnologies.ca Springer http://www.springer.com/
Table of Contents
Dynamic Obstacle Representations for Robot and Virtual Agent Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eric Aaron and Juan Pablo Mendoza Grounding Formulas with Complex Terms . . . . . . . . . . . . . . . . . . . . . . . . . . Amir Aavani, Xiongnan (Newman) Wu, Eugenia Ternovska, and David Mitchell Moving Object Modelling Approach for Lowering Uncertainty in Location Tracking Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wegdan Abdelsalam, David Chiu, Siu-Cheung Chau, Yasser Ebrahim, and Maher Ahmed
1 13
26
Unsupervised Relation Extraction Using Dependency Trees for Automatic Generation of Multiple-Choice Questions . . . . . . . . . . . . . . . . . . Naveed Afzal, Ruslan Mitkov, and Atefeh Farzindar
32
An Improved Satisfiable SAT Generator Based on Random Subgraph Isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cˇ alin Anton
44
Utility Estimation in Large Preference Graphs Using A* Search . . . . . . . . Henry Bediako-Asare, Scott Buffett, and Michael W. Fleming A Learning Method for Developing PROAFTN Classifiers and a Comparative Study with Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nabil Belacel and Feras Al-Obeidat
50
56
Using a Heterogeneous Dataset for Emotion Analysis in Text . . . . . . . . . . Soumaya Chaffar and Diana Inkpen
62
Using Semantic Information to Answer Complex Questions . . . . . . . . . . . . Yllias Chali, Sadid A. Hasan, and Kaisar Imam
68
Automatic Semantic Web Annotation of Named Entities . . . . . . . . . . . . . . Eric Charton, Michel Gagnon, and Benoit Ozell
74
Learning Dialogue POMDP Models from Data . . . . . . . . . . . . . . . . . . . . . . . Hamid R. Chinaei and Brahim Chaib-draa
86
Characterizing a Brain-Based Value-Function Approximator . . . . . . . . . . . Patrick Connor and Thomas Trappenberg
92
XII
Table of Contents
Answer Set Programming for Stream Reasoning . . . . . . . . . . . . . . . . . . . . . Thang M. Do, Seng W. Loke, and Fei Liu A Markov Decision Process Model for Strategic Decision Making in Sailboat Racing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel S. Ferguson and Pantelis Elinas Exploiting Conversational Features to Detect High-Quality Blog Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas FitzGerald, Giuseppe Carenini, Gabriel Murray, and Shafiq Joty Consolidation Using Context-Sensitive Multiple Task Learning . . . . . . . . . Ben Fowler and Daniel L. Silver Extracting Relations between Diseases, Treatments, and Tests from Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oana Frunza and Diana Inkpen
104
110
122
128
140
Compact Features for Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . Lisa Gaudette and Nathalie Japkowicz
146
Instance Selection in Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . Yuanyuan Guo, Harry Zhang, and Xiaobo Liu
158
Determining an Optimal Seismic Network Configuration Using Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Machel Higgins, Christopher Ward, and Silvio De Angelis
170
Comparison of Learned versus Engineered Features for Classification of Mine Like Objects from Raw Sonar Images . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Hollesen, Warren A. Connors, and Thomas Trappenberg
174
Learning Probability Distributions over Permutations by Means of Fourier Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ekhine Irurozki, Borja Calvo, and Jose A. Lozano
186
Correcting Different Types of Errors in Texts . . . . . . . . . . . . . . . . . . . . . . . . Aminul Islam and Diana Inkpen
192
Simulating the Effect of Emotional Stress on Task Performance Using OCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dreama Jain and Ziad Kobti
204
Base Station Controlled Intelligent Clustering Routing in Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yifei Jiang and Haiyi Zhang
210
Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and Second-Order Co-occurrence Measures . . . . . Colette Joubarne and Diana Inkpen
216
Table of Contents
XIII
A Supervised Method of Feature Weighting for Measuring Semantic Relatedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alistair Kennedy and Stan Szpakowicz
222
Anomaly-Based Network Intrusion Detection Using Outlier Subspace Analysis: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Kershaw, Qigang Gao, and Hai Wang
234
Evaluation and Application of Scenario Based Design on Thunderbird . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bushra Khawaja and Lisa Fan
240
Improving Phenotype Name Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maryam Khordad, Robert E. Mercer, and Peter Rogan
246
Classifying Severely Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . William Klement, Szymon Wilk, Wojtek Michalowski, and Stan Matwin
258
Simulating Cognitive Phenomena with a Symbolic Dynamical System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Othalia Larue
265
Finding Small Backdoors in SAT Instances . . . . . . . . . . . . . . . . . . . . . . . . . . Zijie Li and Peter van Beek
269
Normal Distribution Re-Weighting for Personalized Web Search . . . . . . . . Hanze Liu and Orland Hoeber
281
Granular State Space Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jigang Luo and Yiyu Yao
285
Comparing Humans and Automatic Speech Recognition Systems in Recognizing Dysarthric Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kinfe Tadesse Mengistu and Frank Rudzicz A Context-Aware Reputation-Based Model of Trust for Open Multi-agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ehsan Mokhtari, Zeinab Noorian, Behrouz Tork Ladani, and Mohammad Ali Nematbakhsh Pazesh: A Graph-Based Approach to Increase Readability of Automatic Text Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nasrin Mostafazadeh, Seyed Abolghassem Mirroshandel, Gholamreza Ghassem-Sani, and Omid Bakhshandeh Babarsad Textual and Graphical Presentation of Environmental Information . . . . . Mohamed Mouine
291
301
313
319
XIV
Table of Contents
Comparing Distributional and Mirror Translation Similarities for Extracting Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Philippe Muller and Philippe Langlais Generic Solution Construction in Valuation-Based Systems . . . . . . . . . . . . Marc Pouly
323 335
Cross-Lingual Word Sense Disambiguation for Languages with Scarce Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bahareh Sarrafzadeh, Nikolay Yakovets, Nick Cercone, and Aijun An
347
COSINE: A Vertical Group Difference Approach to Contrast Set Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mondelle Simeon and Robert Hilderman
359
Hybrid Reasoning for Ontology Classification . . . . . . . . . . . . . . . . . . . . . . . . Weihong Song, Bruce Spencer, and Weichang Du
372
Subspace Mapping of Noisy Text Documents . . . . . . . . . . . . . . . . . . . . . . . . Axel J. Soto, Marc Strickert, Gustavo E. Vazquez, and Evangelos Milios
377
Extending AdaBoost to Iteratively Vary Its Base Classifiers . . . . . . . . . . . ´ Erico N. de Souza and Stan Matwin
384
Parallelizing a Convergent Approximate Inference Method . . . . . . . . . . . . Ming Su and Elizabeth Thompson
390
Reducing Position-Sensitive Subset Ranking to Classification . . . . . . . . . . Zhengya Sun, Wei Jin, and Jue Wang
396
Intelligent Software Development Environments: Integrating Natural Language Processing with the Eclipse Platform . . . . . . . . . . . . . . . . . . . . . . Ren´e Witte, Bahar Sateli, Ninus Khamis, and Juergen Rilling
408
Partial Evaluation for Planning in Multiagent Expedition . . . . . . . . . . . . . Y. Xiang and F. Hanshar
420
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
433
Dynamic Obstacle Representations for Robot and Virtual Agent Navigation Eric Aaron and Juan Pablo Mendoza Department of Mathematics and Computer Science Wesleyan University Middletown, CT 06459
Abstract. This paper describes a reactive navigation method for autonomous agents such as robots or actors in virtual worlds, based on novel dynamic tangent obstacle representations, resulting in exceptionally successful, geometrically sensitive navigation. The method employs three levels of abstraction, treating each obstacle entity as an obstacle-valued function; this treatment enables extraordinary flexibility without pre-computation or deliberation, applying to all obstacles regardless of shape, including non-convex, polygonal, or arc-shaped obstacles in dynamic environments. The unconventional levels of abstraction and the geometric details of dynamic tangent representations are the primary contributions of this work, supporting smooth navigation even in scenarios with curved shapes, such as circular and figure-eight shaped tracks, or in environments requiring complex, winding paths.
1 Introduction For autonomous agents such as robots or actors in virtual worlds, navigation based on potential fields or other reactive methods (e.g., [3,4,6,9,10]) can be conceptually elegant, robust, and adaptive in dynamic or incompletely known environments. In some methods, however, straightforward geometric representations can result in ineffective obstacle avoidance or other navigation difficulties. In this paper, we introduce reactive navigation intelligence based on dynamic tangent obstacle representations and repellers, which are locally sensitive to relevant obstacle geometry, enabling effective navigation in a wide range of environments. In general, reactive navigation is fast and responsive in dynamic environments, but it can be undesirably insensitive to some geometric information in complicated navigation spaces. In some potential-based or force-based approaches, for instance, a circular obstacle would be straightforwardly treated as exerting a repulsive force on agents around it, deterring collisions; as an example, Figure 1 illustrates an angular repeller form employed in [5,7,8], in which a circle-shaped obstacle obsi repels circle-shaped agent A by steering A’s heading angle away from all colliding paths. (See Section 2 for additional information on this kind of angular repeller.) Straightforwardly, the repeller representation of obstacle obsi is based on the entire shape of obsi . Such a straightforward connection between the entire shape of an obstacle entity and the repeller representation of that entity, however, is not always so successful. Some common obstacle entities, for example, have shapes inconsistent with otherwise-effective navigation methods; for C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 1–12, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
E. Aaron and J.P. Mendoza
ri ci
vm
2Δψi
rA
A
cA
φ − ψi
ψi φ
pm
obsi
x
Fig. 1. Obstacle avoidance, with agent A, obstacle obsi , and other elements as labeled. When heading angle φ is inside the angular range delimited by the dotted lines—i.e., when some point of A is not headed outside of obsi —A is steered outside of that angular range, avoiding collision.
example, navigation methods requiring circle-based obstacle representations [2,3,5,7] can be ineffective with obstacles that have long, thin shapes, such as walls. Indeed, our work is motivated by difficulties in applications requiring navigation near or along walls in dynamic environments, such as boundary inspection [1] or navigation in hallways. For this paper, we distinguish between boundary-proximate and boundarydistant navigation: Boundary-proximate behaviors require navigation along obstacle boundaries, whereas boundary-distant behaviors require only collision avoidance, which tends to deter proximity to obstacle boundaries. Boundary-distant reactive behaviors can often be straightforwardly achieved by, e.g., potential-based navigation that employs forceful repellers and ignores details such as concavities in obstacle shapes. Boundary-proximate reactive behaviors, however, are more challenging. This paper is thus focused on boundary-proximate behaviors (although as noted in Section 5, our method supports both kinds of behavior), presenting efficient, geometrically sensitive, dynamic obstacle representations that support boundary-proximate navigation. In particular, this paper introduces dynamic tangent (DT, for short) obstacle representations as intermediaries between obstacle entities and repeller representations. Dynamic tangent-based DT navigation treats each obstacle entity as an obstacle-valued function, which returns an obstacle representation: For each agent A, at each timestep in computing navigation, each perceived obstacle entity (e.g., a wall, a polygon, another navigating agent) is represented by a dynamic tangent obstacle representation; each obstacle representation is mathematically modeled as an angular range of repulsion—or, alternatively, as the part of the obstacle entity within that angular range from which A is repelled. Hence, unlike other approaches in which only two levels of information are reflected in obstacle modeling, DT navigation employs a three-tiered structure: 1. the obstacle entity—the full geometric shape of the entity in the environment; 2. the obstacle representation—the DT representation abstracted from the obstacle entity, i.e., the locally usable geometry upon which a repeller form is based; 3. and the repeller representation—the mathematical function encoding the angular repulsion ascribed to the obstacle entity.
Dynamic Obstacle Representations for Robot and Virtual Agent Navigation
3
The additional level of abstraction in this three-level structure and the geometric details of our DT representations are the primary contributions of this paper. The mathematical functions for the repeller representations in our DT navigation are similar to those of a standard mathematical form described in [2,5,8], and based on only three arguments: the minimum distance from agent A to a nearest point pm on an obstacle entity; the difference φ−ψ between heading angle φ of A and angle ψ from A to pm ; and an angular range of repulsion Δψ, which controls the spectrum of heading angles from which the obstacle will repel the agent. The resulting DT representations are effective and efficient, and they satisfy desirable properties of obstacle representations—such as proximity, symmetry, and necessary and sufficient repulsion—that underlie successful navigation in other methods (see Section 4). Indeed, as detailed in Section 3, DT representations enable exceptionally effective, geometrically sensitive navigation without pre-computation or deliberation. Even in simulations of non-holonomic agents (e.g., robots) moving at constant speed, DT representations result in smooth, successful navigation in scenarios with complicated paths, moving walls, or curved shapes such as circular or figure-eight shaped tracks.
2 Repellers The repellers underlying our DT representations are based on the same mathematics as in [5,7], although our repellers are capable of expressing a wider range of configurations. In this section, we describe these repellers, summarizing previous presentations of the underlying mathematics and emphasizing particulars of our design. In broad terms, DT navigation is an application of dynamical systems-based behavior modeling [8]. For this paper, and consistent with related papers [5,7], agents are circular and agent velocity is constant throughout navigation—obstacle avoidance and target seeking arise only from the dynamical systems governing heading angle φ. As noted in [2] and elsewhere, velocity could be autonomously controlled, but our present method shows the effectiveness of DT representations even without velocity control. The repellers themselves in DT navigation are angular repeller functions, dynamically associated with obstacles based on local perception, without pre-computation. Most of the mathematical ideas underlying these functions remain unchanged from [5], thus retaining strengths such as competitive behavioral dynamics and some resistance to problems such as local minima. (See [5,7,8] for details about strengths of this particular repeller design.) Conventionally, with these repellers, every obstacle entity is represented using only circles, and circles’ radii determine the associated repeller representations; in the underlying mathematics [5], however, repellers do not actually depend on circles’ radii, but only on angular ranges of repulsion. Previous presentations do not emphasize that these angular ranges need not be derived from radii, nor that requiring a circle (thus a radius r ≥ 0) for the repellers can restrict the expressiveness of obstacle representations: A point obstacle at the intersection of the dotted lines in Figure 1, for instance, would have the same angular range of repulsion as obstacle obsi ; because no entity could be smaller than a point, no smaller range could occur from an obstacle
4
E. Aaron and J.P. Mendoza
at that location. Smaller ranges, however, could be productively employed by variants of these previous techniques, for more flexible and sensitive navigation dynamics. In this section, we describe our repellers, which are based on angular ranges, and in Section 3, we describe the particular angular ranges calculated for DT navigation. To briefly summarize work from [7] and other papers, the evolution of agent heading angle φ during navigation is determined by angular repellers and angular attractors in a dynamical system of the form φ˙ = |wtar |ftar + |wobs |fobs + noise,
(1)
where φ˙ is the time derivative of φ (by convention, dotted variables are time derivatives), ftar and fobs are functions representing targets and obstacles, respectively—the contributions of attractors and repellers to agent steering—and wtar and wobs are weight functions for each term. (The noise term prevents undesired fixed points in the dynamics.) Thus, at each timestep, φ˙ is calculated for steering at that moment, naturally and reactively accommodating new percepts or changes in the environment. A target is represented by a simple sine function: ftar = −a sin(φ − ψtar ).
(2)
This function induces a clockwise change in heading direction when φ is counterclockwise with respect to the target (and symmetrically, counter-clockwise attraction when φ is clockwise), thus attracting an agent. Obstacle functions are more complicated, encoding windowed repulsion scaled by distance, so that more distant repellers have weaker effects than closer ones, and repellers do not affect agents already on collision-free paths. (Full details are in [7], which we only concisely summarize here.) For an obstacle obsi , the angular repeller in DT navigation is the product of three functions: Ri = Wi =
i φ − ψi 1−| φ−ψ Δψi | e Δψi
(3)
tanh(h1 (cos(φ−ψi )−cos(Δψi +σ)))+1 2
(4)
− ddm
Di = e
0
.
(5)
Function Ri is an angular repeller with angular width Δψi , centered around headingangle value ψi (see Figure 1); windowing function Wi limits repulsion to have significant effects only within Δψi (plus a safety margin σ) from ψi ; and scaling function Di limits the strength of the overall repulsion based on dm , the minimum distance between the agent and the obstacle. (Designer-chosen constant d0 is a scaling parameter for Di .) Each repeller, then, is a product fobsi = Ri · W i · Di , and for navigation, contributions of individual repellers are summed to fobs = i fobsi and then combined with ftar in the weighted sum of Equation 1 to control steering. The weights themselves are determined by a system of competitive dynamics for reactive behavior selection; full details are available in [5].
Dynamic Obstacle Representations for Robot and Virtual Agent Navigation
φ
E
β1
β2 s
φ
Δψ
vm
eA
(c)
pm
cA
E
A
β2 s vm
β1 Δψ
D
pm
cA
β1 Δψ
eA
rA
vm
(b)
rA
rA
A
β2 s
D
φ cA
eA
D
(a)
5
pm
A E
Fig. 2. Default, non-boundary case DT representations for various obstacle entity shapes. In all cases, 2Δψ is chosen to be the angular range subtended by a segment of length 2D constructed around point pm , perpendicular to vector vm .
3 Dynamic Tangent Obstacle Representations Because of the additional level of abstraction for DT representations, the dynamic sensitivity of the repellers in Section 2 is enhanced by increased flexibility and local sensitivity to geometry, without deliberation or pre-computation. DT representations are constructed from locally relevant portions of obstacle entities’ shapes, which for this paper are presumed to be always either straight lines or arcs of circles. Based on this, we consider three possible cases for any relevant component shape of an obstacle: straight line, convex non-line, or concave (i.e., a boundary of a linear, convex, or concave portion of the obstacle). Processes for DT construction are similar in each case, so we here present details of geometry applicable to all cases, with case-specific details in the following subsections. In any of the cases, given agent A and obstacle entity E, the DT representation of E can be seen as the portion of E within a reactively calculated angular range of repulsion for E; in Figures (e.g., Figure 2), we conventionally indicate such a portion by thicker, lighter colored boundary lines on E. For this paper, the angular range (from A) defining that portion of E is the subtended range of a line segment, as shown in Figure 2; in general, as Figure 2 also suggests, this DT segment is locally tangent to E at a projection point pm of minimal distance between A and E. More specifically, the segment is oriented perpendicular to the vector vm that joins the center of A to pm , and it is centered at pm , extending a distance D in each direction, where D is an agent- or application-specific parameter. Parameter D thus determines angular range Δψ—the DT segment represents an angular repeller of range 2Δψ, with range Δψ in each direction around vm (see Figure 2). In examples in this paper, D is constant over a navigation (rather than, e.g., Δψ being constant over a navigation), so Δψ relates to |vm | in desirable ways: For example, as A gets closer to E, the subtended range widens, resulting in greater repulsion. In the default (i.e., non-boundary) conditions in each of the three cases, the angular range 2Δψ subtended by the segment of length 2D around pm can be found from elements labeled in Figure 2: −1
Δψ = β1 + β2 = sin
r A
s
−1
+ sin
D s
(6)
E. Aaron and J.P. Mendoza
cA
β1
rA
A β2 v Δψ vm
pm E
(b)
eA β1
A
φ
cA rA
E
β2 Δψ vm
pm D
A
eA
vA
(a) φ
D
6
Fig. 3. Boundary case DT construction for a straight line obstacle entity. In the direction agent A is heading with respect to vector vm , the entity’s subtended angular range is smaller than a standard DT representation’s subtended angular range, so the range of the DT representation is modified to prevent repulsion from non-colliding paths.
As non-default, boundary cases, we consider instances where the angular range of the resulting repeller is wider than the angular range subtended by the original obstacle entity E, as shown in Figures 3–5. In these cases, to prevent needless repulsion from collision-free headings, DT representations have angular range exactly equal to that of E in the direction the agent is headed with respect to vm ; the computations underlying these representations depend on the shape of E, as described below. 3.1 Straight Lines When the obstacle entity E is a wall or some other straight line shape, finding pm for a DT representation is straightforward: If there is a perpendicular from the center of agent A to E, then pm is the intersection of E and the perpendicular; otherwise, pm is the closest endpoint of E to A. In default cases, construction of the DT segment—a portion of E—and the resulting repeller is also straightforward. In boundary cases, it is necessary to find the subtended range of E with respect to A in the direction that A is heading around E. For this, the process is the same whether vm ⊥ E (Figure 3a) or not (Figure 3b). First, find endpoint eA of E in the direction that A is headed. The appropriate angular range for repulsion is then given by: rA |vm |2 + |vA |2 − |eA − pm |2 Δψ = sin−1 + cos−1 (7) |vA | 2|vm ||vA | 3.2 Convex Shapes For a convex chain of straight lines, finding pm is again straightforward for agent A, and desired range Δψ can be computed similarly to the process in Subsection 3.1, using vertices for boundary cases (see Figure 4a). For a convex circular arc (Figure 4b), however, the subtended angular range is not necessarily defined by its endpoints. Consider such an arc to be defined by the radius ro and center-point co of its defining circle Co , as perceived by A, and by the visible angular range of Co included in the arc, defined by angles θi and θf with respect to co and the positive x axis. Then, it is again straightforward to find closest point pm to A, and in default cases, the DT segment and associated angular range follow immediately.
Dynamic Obstacle Representations for Robot and Virtual Agent Navigation
cA
β2 v A
A
Δψ
(b)
eA pm
ρ
φ E
cA
vm
Δα
pm
rA
rA
vm
β1
ro
φ
D
(a)
A
E
7
eA θi co θf
Fig. 4. Boundary case DT construction for convex shapes. For convex circular arcs (b), it must be determined if the subtended angle Δα between agent A and the entire circle Co (of which the arc is a portion) is the angle Δψ from which the DT representation is derived.
In boundary cases, it remains to find the angle subtended by the arc, to compare to the angle subtended by the DT segment. To do this, we find endpoint eA similarly to the straight line case, and we observe that the desired subtended angular range is limited by either eA or by the point called ρ in Figure 4b, which defines (one side of) the subtended angular range 2Δα between A and the entire circle Co . The remainder of the DT construction then follows as before, with angular range of repulsion determined by either parameter D or the appropriate boundary condition described here. 3.3 Concave Shapes Unlike the convex case, concave shapes bring safety concerns: Given non-holonomic agents with constant forward velocity, some environments cannot be navigated safely, such as a corner or box requiring sharper turns than motion constraints allow agents to make. In DT navigation, we prevent such difficulties by automatically approximating each concave corner by an arc with a radius large enough to be safely navigable: first, using properties of agent velocity and geometry, we calculate the agent’s minimum radius for safe turns, rmin ; then, when computing navigation, an unsafe corner is effectively modeled by a navigable arc, as in Figure 5a. (We also presume that all arcs in the environment have radius at least rmin , although tighter arcs could similarly be modeled by larger, navigable ones.) To approximate only the minimum amount necessary, DT representations treat the corner as if an arc of radius rmin were placed tangent to the lines forming the corner, as in Figure 5a; after finding half of the angle formed by the rmin corner, θc , the distance dc from the corner at which the arc should begin is dc = tan . (θc ) It is straightforward to find point pm and manage boundary cases with endpoints of the chain, thus completing the definition of the representation. For concave arcs, procedures are similar but complementary to those for convex arcs. If A is not located between co (the center of the circle from which the arc is derived) and the arc itself, closest point pm to agent A is an endpoint of the arc; otherwise, pm is the point on the arc in the same direction from co as cA . Endpoint eA is found by following the arc clockwise from pm if agent heading direction is clockwise from pm , and counter-clockwise otherwise (see Figure 5b). For boundary cases, the subtended angular range of a concave arc is always determined by endpoint eA , and can thus be found similarly to previous cases.
E. Aaron and J.P. Mendoza
(a)
(b) E
in
θc
A
dc
rm
φ
eA β1
β2 cA
vA
Δψ vm
D
8
pm
E
Fig. 5. Concave corner and arc shapes, with labeled features pertinent to DT computation: (a) approximating a corner by an arc; (b) finding a subtended angle
4 Properties of Obstacle Representations As part of designing DT representations, we identified several desirable properties that reactive obstacle representations could possess, properties that clearly motivated previous work such as [5,7]. As part of evaluating DT representations, we present our list of these properties and very briefly describe how DT representations satisfy them. Proximity and Symmetry. The focal point for computing repulsion is a point pm on obstacle entity E of minimal distance from agent A. Thus, the nearest point on E to A—with which A might in principle collide soonest—is also the nearest point of the obstacle representation of E to A, enabling appropriate distance-based effects of repulsion. Furthermore, repulsion is centered around pm and associated vector vm (see Figures 1 and 2), so the reactive, local obstacle representation aptly does not determine in which general direction A heads around E—the representation steers A around E in the direction A was already heading, symmetrically around vm , regardless of the heading of A. Necessary and sufficient repulsion. The repulsive range of the obstacle representation corresponds to exactly the heading angles along which the agent would collide with the obstacle entity. Reactivity. Obstacle avoidance dynamically applies to both stationary and moving obstacles, without pre-computation or non-local knowledge. Mathematical parsimony. Each obstacle entity is represented by a single repeller, neither overloading agent computations nor requiring needless mathematical machinations. This enables straightforward utility in a range of scenarios. Previous obstacle representations that satisfy these properties were effective only in substantially restricted environments, i.e., consisting of only circular obstacles (see, e.g., Figure 1). Our DT representations are far more flexible, also applying to noncircular obstacles, and they directly satisfy properties of reactivity, mathematical parsimony, proximity, and symmetry, as well as a relaxed sense of necessary and sufficient repulsion: Because DT-based repellers are bounded by the maximum angular range of the obstacle entity remaining in the direction A is heading (toward endpoint eA ), no collision-free paths in that direction are repelled. Furthermore, due to proximity and
Dynamic Obstacle Representations for Robot and Virtual Agent Navigation
9
symmetry properties, each time repulsion is calculated, any possible collision point would be at or very near pm . Thus, for large enough D with respect to agent size and navigation calculations, any colliding path would be within the range of repulsion of a DT-based repeller, and no collision-free path would ever be in that range.
5 Demonstrations To test DT navigation, we created a simple, OpenGL-based simulator and simulated navigation in several scenarios. For each agent A, obstacle boundaries were locally perceived, and perception was straightforwardly implemented so that portions of obstacles occluded by other obstacles were not perceived, but other entities were perceived with unlimited range in all directions. Default parameter values for the repellers of Section 2 and [5,7] were d0 = 2.0, σ = 0.4, and D = 4rA unless otherwise noted, and navigation was in a 12 by 12 unit world, with velocity held constant at 0.3, isolating heading angle φ as the only behavioral variable governing navigation, as in [5,7]. Tests were of two general kinds: basic testing to calibrate the value of D and establish general DT effectiveness in common scenarios; and testing in complex environments. 5.1 Basic Testing and Calibration The default value D = 4rA in our demonstrations was chosen after experimentally determining the effect of D on navigation. In general, greater values of D lead to repellers with greater angular ranges of repulsion, but for an obstacle entity that subtends a large angular range, it is not always desirable for a repeller to subtend that entire range. For example, such large repellers can preclude the boundary-proximate navigation (Figure 6) on which this paper is focused, as discussed in Section 1. Boundary-distant navigation, in contrast, can be supported by such large repellers, but boundary-distant navigation can also be readily supported by appropriate DT representations (Figure 7). The local sensitivity enabled by DT representations, however, is not fully exploited in boundary-distant applications. For finer-tuned, boundary-proximate navigation, we first calibrated D for appropriate sensitivity in DT representations. To do this, we ran experiments with a single agent A navigating along a wall, which indicated that distance dm of agent A from the wall systematically varied with the value of D. We also considered a thought experiment— i.e., among many differences between an elephant and an ant, they maintain different safety margins when walking along a wall—and thus selected an agent size-dependent value of D = 4rA , where rA is the radius of agent A; this results in a dm of between one and two radii for agents, which seems safe but not excessive. We then tested DT navigation in the basic scenarios shown in Figure 6, each with a convex or concave obstacle. In each scenario, agents started at 100 randomly selected positions spanning the left sides of their worlds, and DT navigation achieved perfect performance: Every agent reached its target without collision. 5.2 Complicated Environments We also tested agents in more complicated environments, as shown in Figure 9. In the Hallways scenario, approximating an indoor layout with 3 × 2-sized office-obstacles
10
E. Aaron and J.P. Mendoza
(a) Octagon
(b) Convex Arc
(c) Concave Corner
(d) Concave Arc
Fig. 6. Basic scenarios for demonstrations of purely reactive boundary-proximate navigation. Each image contains an obstacle entity, a target (green circle), and a sample trajectory.
(a)
(a) Convex Arc
(b) Concave Corner
Fig. 7. Demonstrations of DT-based boundarydistant navigation. Agents started out facing the target but turned quickly, taking a smooth, efficient path to the target.
(b)
Fig. 8. Two different-sized agents, sizes rA = 0.1 and 0.3, reaching parallel paths along a wall, each from a setting of D = 4rA . Figures show the target locations (green circles), trajectories, and DT representations of the wall for each.
(the inner rectangles) and hallway width roughly 10-to-20 times rA , agents navigated from 100 starting positions in the left of their world to a sequence of five target locations (Figure 9a), requiring extensive navigation and turning. Purely reactive DT navigation performance was perfect in all tested variants of this hallway scenario, including versions with additional circular obstacles, stationary or moving, in hallways. The Polygons scenario (Figure 9b) incorporates navigation around a moving wall, which rotates in the center of the space, and a variety of convex polygons. In these experiments, agents navigated to five target locations (similar to those in the Hallways scenario), requiring a full traversal of the horizontal space; because of the additional difficulty posed by this scenario, the values of d0 and σ were raised to 2.25 and 0.6, for repulsion at greater distances. Tests of DT navigation showed very good performance: Of 100 agents tested, starting from positions spanning the left side and top of this environment, 99 reached all targets without colliding. (Avoiding the moving wall proved difficult, perhaps due to the restriction to constant velocity.)
Dynamic Obstacle Representations for Robot and Virtual Agent Navigation (a) Hallways
(b) Polygons
(c) Circle Track
(d) Figure−Eight
11
(e) Winding Path
Fig. 9. Different scenarios in which DT navigation was tested, showing target locations and an example agent trajectory in each: (a) Hallways; (b) Polygons; (c) Circle Track; (d) Figure-Eight; (e) Winding Path
Fig. 10. A race-like run in the Circle Track scenario, including target locations and a trajectory of a fast agent that steered around slower agents
The remaining tests in complex environments were performed in scenarios with curved shapes (Figure 9c–e): a Circle Track; a Figure-Eight; and a Winding Path. The Winding Path scenario illustrates how even purely reactive DT navigation can succeed along a very complicated path, and the Figure-Eight scenario shows successful navigation in a closely bounded, curved environment. In the Figure-Eight and Circle Track scenarios, targets were alternatingly placed on the top and bottom of the tracks (indicated in Figure 9), to keep agents looping around the tracks. Race-like demonstrations were also run on the Circle Track (Figure 10), with up to four agents at different speeds, all successfully avoiding each other and the boundaries of the track while running.
6 Conclusion This paper presents a new, dynamic tangent-based navigation method, which treats obstacle entities as obstacle-valued functions: Each agent represents each obstacle as an angular repeller, dynamically adjusted during navigation to support successful performance. The obstacle representation level of abstraction enables enhanced geometric sensitivity while retaining desired properties of obstacle representations. Simulations demonstrate that DT navigation is successful even in applications where agents must navigate closely around obstacle shapes and scenarios with a moving wall or complicated environments requiring circular or winding paths. DT representations might also
12
E. Aaron and J.P. Mendoza
be effective in a wider range of environments if based on context-dependent variations in the value of D or with learning-based adaptations; the fact that DT representations require so few parameters may facilitate developmental or learning-based approaches. Acknowledgments. The authors thank Clare Bates Congdon and anonymous referees for comments on previous versions of this paper.
References 1. Easton, K., Burdick, J.: A coverage algorithm for multi-robot boundary inspection. In: Int. Conf. Robotics and Automation, pp. 727–734 (2005) 2. Goldenstein, S., Karavelas, M., Metaxas, D., Guibas, L., Aaron, E., Goswami, A.: Scalable nonlinear dynamical systems for agent steering and crowd simulation. Computers and Graphics 25(6), 983–998 (2001) 3. Huang, W., Fajen, B., Fink, J., Warren, W.: Visual navigation and obstacle avoidance using a steering potential function. Robotics and Autonomous Systems 54(4), 288–299 (2006) 4. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. Int. Journal of Robotics Research 5(1), 90–98 (1986) 5. Large, E., Christensen, H., Bajcsy, R.: Scaling the dynamic approach to path planning and control: Competition among behavioral constraints. International Journal of Robotics Research 18(1), 37–58 (1999) 6. Paris, S., Pettr´e, J., Donikian, S.: Pedestrian reactive navigation for crowd simulation: A predictive approach. Computer Graphics Forum 26(3) (2007) 7. Sch¨oner, G., Dose, M.: A dynamical systems approach to task-level system integration used to plan and control autonomous vehicle motion. Robotics and Autonomous Systems 10(4), 253–267 (1992) 8. Sch¨oner, G., Dose, M., Engels, C.: Dynamics of behavior: Theory and applications for autonomous robot architectures. Robotics and Autonomous Systems 16(2-4), 213–245 (1995) 9. Shao, W., Terzopoulos, D.: Autonomous pedestrians. Graphical Models 69(5-6), 246–274 (2007) 10. Treuille, A., Cooper, S., Popovi´c, Z.: Continuum crowds. ACM Trans. on Graphics 25(3), 1160–1168 (2006)
Grounding Formulas with Complex Terms Amir Aavani, Xiongnan (Newman) Wu, Eugenia Ternovska, and David Mitchell Simon Fraser University {aaa78,xwa33,ter,mitchell}@sfu.ca
Abstract. Given a finite domain, grounding is the the process of creating a variable-free first-order formula equivalent to a first-order sentence. As the firstorder sentences can be used to describe a combinatorial search problem, efficient grounding algorithms would help in solving such problems effectively and makes advanced solver technology (such as SAT) accessible to a wider variety of users. One promising method for grounding is based on the relational algebra from the field of Database research. In this paper, we describe the extension of this method to ground formulas of first-order logic extended with arithmetic, expansion functions and aggregate operators. Our method allows choice of particular CNF representations for complex constraints, easily.
1 Introduction An important direction of work in constraint-based methods is the development of declarative languages for specifying or modelling combinatorial search problems. These languages provide users with a notation in which to give a high-level specification of a problem (see e.g., ESSENCE [1]). By reducing the need for specialized constraint programming knowledge, these languages make the technology accessible to a wider variety of users. In our group, a logic-based framework for specification/modelling language was proposed [2]. We undertake a research program of both theoretical development and demonstrating practical feasibility through system development. Our tools are based on grounding, which is the task of taking a problem specification, together with an instance, and producing a variable-free first-order formula representing the solutions to the instance1 . Here, we consider grounding to propositional logic, with the aim of using propositional satisfiability (SAT) solvers as the problem solving engine. Note that SAT is just one possibility. A similar process can be used for grounding from a high-level language to e.g., CPLEX, various Satisfiability Modulo Theory (SMT) and ground constraint solvers, e.g., MINION [3], etc. An important advantage in solving through grounding is that the speed of ground solvers improves all the time, and we can always use the best and the latest solver available. Grounding a first-order formula over a given finite domain A may be done simply by replacing ∀x φ(x) with ∧a∈A φ(x)[x/˜ a], and ∃x φ(x) with ∨a∈A φ(x)[x/˜ a] where a ˜ is a new constant symbol denoting domain element a and φ(x)[x/˜ a] denotes substituting a ˜ for every occurrence of x in φ. In practice, though, effective grounding is not easy. Naive methods are too slow, and produce groundings that are too large and contain many redundant clauses. Patterson et. al. defined a basic grounding method for function-free first-order logic (FO) in [4,5], and a prototype implementation is described in [5]. Expressing most of 1
By instance we always understand an instance of a search problem, e.g., a graph is an instance of 3-colourability.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 13–25, 2011. c Springer-Verlag Berlin Heidelberg 2011
14
A. Aavani et al.
interesting real-world problems, e.g., Traveling Salesman problem or Knapsack problem, with function-free FO formula without having access to arithmetical operators is not an easy task. So, enriching the syntax with functions and arithmetical operators is a necessity. We describe how we have extended the existing grounding algorithm such that it can handle these constructs. It is important to notice that the model expansion problem [5] is very different from query evaluation process. In model expansion context, there are formulas and subformulas which cannot be evaluated, while in query processing context, every formula can be evaluated as either true or false. First-order model expansion, when we are talking about finite domain, allows one to describe NP-complete problems while the query processing problem for FO, in finite domain context, is polynomial time. In this paper, we are interested in solving model expansion problem. An important element in the practice of SAT solving is the choice, when designing reductions, of “good” encodings into propositional logic of complex constraints. We describe our method for grounding of formulas containing aggregate operations in terms of “gadgets” which determine the actual encoding. The choice of the particular gadget can be under user control, or even made automatically at run-time based on formula and instance properties. Even within one specification, different occurrences of the same aggregate may be grounded differently, and this may vary from instance to instance. With well designed (possibly by machine learning methods) heuristics for such choices, we may be able to produce groundings that are more effective in practice than those a human could design by hand, except through an exceedingly labour-intensive process. Our main contributions are: 1. We present an algorithm which can be used to ground specifications having different kinds of terms, e.g., aggregates, expansion/instance functions, arithmetic. 2. We enrich our language with aggregates, functions and arithmetical expression and design and develop an engine which can convert these constructs to pure SAT instances as well as to instances of SAT solvers which are able to handle more complex constraints such as cardinality constraints or Pseudo-Boolean constraints. 3. We define the notion of answer to terms and modify the previous grounding algorithm to be able to work with this new concept.
2 Background We formalize combinatorial search problems in terms of the logical problem of model expansion (MX), defined here for an arbitrary logic L. Definition 1 (MX). Given an L-sentence φ, over the union of disjoint vocabularies σ and ε, and a finite structure A for vocabulary σ, find a structure B that is an expansion of A to σ ∪ ε such that B |= φ. In this paper, φ is a problem specification formula. A always denotes a finite σ-structure, called the instance structure, σ is the instance vocabulary, and ε the expansion vocabulary, and L is FO logic extended with arithmetic and aggregate operators. Example 1. Consider the following variation of the knapsack problem: We are given a set of items (loads), L = {l1 , · · · , ln }, and weight of each item is specified by an instance function W which maps items to integers (wi = W (li )). We want to check if there is a way to put these n items into m knapsacks, K = {k1 , · · · , km } while satisfying the following constraints:
Grounding Formulas with Complex Terms
15
Certain items should be placed into certain knapsacks. These pairs are specified using the instance predicate “P”. h of these m knapsacks have high capacity, each of them can carry a total load of HCap , while the capacity of the rest of the knapsacks is LCap . We also do not want to put two items whose weights are very different in the same bag, i.e., the difference between the weights of the items in the same bag should be less than Wl . Each of HCap , LCap and Wl is an instance function with arity zero, i.e. a given constant. The following formula φ in the first order logic is a specification for this problem: {A1 {A2 {A3 {A4 {A5 {A6
: ∀l∃k : Q(l, k)}∧ : ∀l∀k1 ∀k2 : (Q(l, k1 ) ∧ Q(l, k2 )) ⊃ k1 = k2 }∧ : ∀l, k : P (l, k) ⊃ Q(l, k)}∧ : ∀k : l:Q(l,k) W (l) ≤ HCap }∧ : COU N Tk { l:Q(l,k) W (l) ≥ LCap } ≤ h}∧ : ∀k, l1 , l2 : (Q(l1 , k) ∧ Q(l2 , k)) ⊃ (W (l1 ) − W (l2 ) ≤ Wl )}
An instance is a structure for vocabulary σ = {P, W, Wl , HCap , LCap }, i.e., a list of pairs, a function which maps items to integers and three constant integers. The task is to find an expansion B of A that satisfies φ: A
A B (L ∪ K; P A , W A , WlA , HCap , LA Cap , Q ) |= φ. B
Interpretations of the expansion vocabulary ε = {Q}, for structures B that satisfy φ, is a mapping from items to knapsacks that satisfies the problem properties. The grounding task is to produce a ground formula ψ = Gnd(φ, A), such that models of ψ correspond to solutions for instance A. Formally, to ground we bring domain elements into the syntax by expanding the vocabulary with a new constant symbol for each element of the domain. For domain A, the domain of structure A, we denote the set of ˜ In practice, the ground formula should contain no occurrences of such constants by A. the instance vocabulary, in which case we call it reduced. Definition 2 (Reduced Grounding for MX). Formula ψ is a reduced grounding of formula φ over σ-structure A = (A; σ A ) if ˜ and 1 ψ is a ground formula over ε ∪ A, 2 for every expansion structure B = (A; σ A , εB ) over σ ∪ ε, B |= φ iff (B, A˜B ) |= ψ, ˜ where A˜B is the standard interpretation of the new constants A. Proposition 1. Let ψ be a reduced grounding of φ over σ-structure A. Then A can be expanded to a model of φ iff ψ is satisfiable. A reduced grounding with respect to a given structure A can be obtained by an algorithm that, for each fixed FO formula, runs in time polynomial in the size of A. Such a grounding algorithm implements a polytime reduction to SAT for each NP search problem. Simple grounding algorithms, however, do not reliably produce groundings for large instances of interesting problems fast enough in practice. Grounding for MX is a generalization of query answering. Given a structure (database) A, a Boolean query is a formula φ over the vocabulary of A, and query answering is equivalent to evaluating whether φ is true, i.e., A |= φ. For model expansion, φ has some
16
A. Aavani et al.
additional vocabulary beyond that of A, and producing a reduced grounding involves evaluating out the instance vocabulary, and producing a ground formula representing the possible expansions of A for which φ is true. The grounding algorithms in this paper construct a grounding by a bottom-up process that parallels database query evaluation, based on an extension of the relational algebra. For each sub-formula φ(¯ x) with free variables x ¯, we call the set of reduced groundings for φ under all possible ground instantiations of x ¯ an answer to φ(¯ x). We represent answers with tables on which an extended algebra operates. An X-relation is a k-ary relation associated with a k-tuple of variables X, representing a set of instantiations of the variables of X. It is a central notion in databases. In extended X-relations, introduced in [4], each tuple γ is associated with a formula ψ. For convenience, we use and ⊥ as propositional formulas which are always true and, false, respectively. Definition 3 (extended X-relation; function δR ). Let A be a domain, and X a tuple of variables with |X| = k. An extended X-relation R over A is a set of pairs (γ, ψ) s.t. 1 γ : X → A, and 2 ψ is a formula, and 3 if (γ, ψ) ∈ R and (γ, ψ ) ∈ R then ψ = ψ . The function δR represented by R is a mapping from k-tuples γ of elements of the domain A to formulas, defined by: δR (γ) =
ψ ⊥
if (γ, ψ) ∈ R, if there is no pair (γ, ψ) ∈ R.
For brevity, we sometimes write γ ∈ R to mean that there exists ψ such that (γ, ψ) ∈ R. We also sometimes call extended X-relations simply tables. To refer to X-relations for some concrete set X of variables, rather than in general, we write X-relation. Definition 4 (answer to φ wrt A). Let φ be a formula in σ ∪ ε with free variables X, A a σ-structure with domain A, and R an extended X-relation over A. We say R is an answer to φ wrt A if for any γ : X → A, δR (γ) is a reduced grounding of φ[γ] over A. Here, φ[γ] denotes the result of instantiating free variables in φ according to γ. Since a sentence has no free variables, the answer to a sentence φ is a zero-ary extended X-relation, containing a single pair (, ψ), associating the empty tuple with formula ψ, which is a reduced grounding of φ. Example 2. Let σ = {P } and ε = {E}, and let A be a σ-structure with P A = {(1, 2, 3), (3, 4, 5)}. Answers to φ1 ≡ P (x, y, z) ∧ E(x, y) ∧ E(y, z), φ2 ≡ ∃zφ1 and φ3 ≡ ∃x∃yφ2 are demonstrated in Table 1. Observe that δR (1, 2, 3) = E(1, 2)∧E(2, 3) is a reduced grounding of φ1 [(1, 2, 3)] = P (1, 2, 3)∧E(1, 2)∧E(2, 3), and δR (1, 1, 1) = ⊥ is a reduced grounding of φ1 [(1, 1, 1)]. E(1, 2) ∧ E(2, 3) is a reduced grounding of φ2 [(1, 2)]. Notice that, as φ3 does not have any free variables, its corresponding answer has just a single row. Table 1. Answers to φ1 , φ2 and φ3 xyz ψ xy ψ ψ 1 2 3 E(1, 2) ∧ E(2, 3) 1 2 E(1, 2) ∧ E(2, 3) [E(1, 2) ∧ E(2, 3)] ∨ [E(3, 4) ∧ E(4, 5)] 3 4 5 E(3, 4) ∧ E(4, 5) 3 4 E(3, 4) ∧ E(4, 5)
Grounding Formulas with Complex Terms
17
The relational algebra has operations corresponding to each connective and quantifier in FO, as follows: complement (negation); join (conjunction); union (disjunction), projection (existential quantification); division or quotient (universal quantification). Following [4,5], we generalize each to extended X-relations as follows. Definition 5 (Extended Relational Algebra). Let R be an extended X-relation and S an extended Y -relation, both over domain A. 1. ¬R is the extended X-relation ¬R = {(γ, ψ) | γ : X → A, δR (γ) = , and ψ = ¬δR (γ)} 2. R S is the extended X ∪ Y -relation R S = {(γ, ψ) | γ : X ∪ Y → A, γ|X ∈ R, γ|Y ∈ S, and ψ = δR (γ|X ) ∧ δS (γ|Y )}; 3. R ∪ S is the extended X ∪ Y -relation R ∪ S = {(γ, ψ) | γ|X ∈ R or γ|Y ∈ S, and ψ = δR (γ|X ) ∨ δS (γ|Y )}. 4. For Z ⊆ X, the Z-projection of R, denoted by πZ (R), is the extended Z-relation {(γ , ψ) | γ = γ|Z for some γ ∈ R and ψ = {γ∈R|γ =γ|Z } δR (γ)}. 5. For Z ⊆ X, the Z-quotient of R, denoted by dZ (R), is the extended Z-relation {(γ , ψ) | ∀γ(γ : X → A∧γ|Z = γ ⇒ γ ∈ R), and ψ = {γ∈R|γ =γ|Z } δR (γ)}. To ground using this algebra, we apply the algebra inductively on the structure of the formula, just as the standard relational algebra may be applied for query evaluation. We define the answer to atomic formula P (¯ x) as follows. If P is an instance predicate, the answer to P is the set of tuples (¯ a, ), for a ¯ ∈ P A . If P is an expansion predicate, the answer is the set of all pairs (¯ a, P (¯ a)), where a ¯ is a tuple of elements from the domain A. Correctness of the method then follows, by induction on the structure of the formula, from the following proposition. Proposition 2. Suppose that R is an answer to φ1 and S is an answer to φ2 , both with respect to (wrt) structure A. Then 1. ¬R is an answer to ¬φ1 wrt A; 2. R S is an answer to φ1 ∧ φ2 wrt A; 3. R ∪ S is an answer to φ1 ∨ φ2 wrt A; 4. If Y is the set of free variables of ∃¯ z φ1 , then πY (R) is an answer to ∃¯ z φ1 wrt A. 5. If Y is the set of free variables of ∀¯ z φ1 , then dY (R) is an answer to ∀¯ z φ1 wrt A. The proof for cases 1, 2 and 4 is given in [4]; the other cases follow easily. The answer to an atomic formula P (¯ x), where P is from the expansion vocabulary, is formally a universal table, in practice we may represent this table implicitly and avoid explicitly enumerating the tuples. As operations are applied, some subset of columns remain universal, while others do not. Again, those columns which are universal may be represented implicitly. This could be treated as an implementation detail, but the use of such implicit representations dramatically affects the cost of operations, and so it is useful to further generalize our extended X-relations. We call the variables which are implicitly universal “hidden” variables, as they are not represented explicitly in the tuples, and the other variables “explicit” variables. We are not going to define this concept here, but interested readers are encouraged to refer to [5]. This basic grounding approach can ground just the axioms A1 , A2 , A3 in example 1. 2.1 FO MX with Arithmetic In this paper, we are concerned with specifications written in FO extended with functions, arithmetic and aggregate operators. Informally, we assume that the domain of any instance structure is a subset of N (set of natural numbers), and that arithmetic operators have their standard meanings. Details of aggregate operators need to be specified,
18
A. Aavani et al.
but these also behave according to our normal intuitions. Quantified variables and the range of instance functions must be restricted to finite subsets of the integers, and possible interpretations of expansion predicates and expansion functions must be restricted to a finite domain of N, as well. This can be done by employing a multi-sorted logic in which all sorts are required to be finite subsets of N, or by requiring specification formulas to be written in a certain “guarded” form. In the rest of this paper, we assume that all variables are ranging over the finite domain2 T ⊂ N and φ(t1 (¯ x), · · · , tk (¯ x)) is a short-hand for ∃y1 , · · · , yk : y1 = t1 (¯ x) ∧ · · · yk = tk (¯ x) ∧ φ(y1 , · · · , yk ). Under these assumptions, we do not need to worry about the interpretation of predicates and functions outside T . Syntax and Semantics of Aggregate Operators. We may use evaluation for formulas with expansion predicates. By evaluating a formula, which has expansion predicates, as true we mean that there is a solution for the whole specification which satisfies the given formula, too. Also, for sake of representation, we may use φ[¯ a, z¯2 ] as a short-hand for φ(z¯1 , z¯2 )[z¯1 /¯ a], which denotes substituting a ¯ for every occurrence of z¯1 in φ. Although our system supports grounding specification having Max, Min, Sum and Count aggregates, but for the sake of space, we just focus on Sum and Count aggregate in this paper: – t(¯ y ) = M axx¯ {t(¯ x, y¯) : φ(¯ x, y¯); dM (¯ y )}, for any instantiation ¯b for y¯, denotes the ¯ maximum value obtained by t[¯ a, b] over all instantiations a ¯ for x ¯ for which φ[¯ a, ¯b] is true, or dM if there is none. dM is the default value of Max aggregate which is returned whenever all conditions are evaluated as false. – t(¯ y ) = M inx¯ {t(¯ x, y¯) : φ(¯ x, y¯); dm (¯ y )} is defined dually to Max. – t(¯ y ) = Sumx¯ {t(¯ x, y¯) : φ(¯ x, y¯)}, for any instantiation ¯b of y¯, denotes 0 plus the sum of all values t[¯ a, ¯b] for all instantiations a ¯ for x ¯ for which φ[¯ a, ¯b] is true. ¯ – t(¯ y ) = Countx¯ {φ(¯ x, y¯)}, for any instantiation b for y¯, denotes the number of tuples a ¯ for which φ[¯ a, ¯b] is true. As we have Countx¯ {φ(¯ x, y¯)} = Sumx¯ {1, φ(¯ x, y¯)}, in the rest of this paper, we assume that all terms having Count aggregate are replaced with the appropriate terms in which Count is replaced with Sum aggregate and so we do not discuss the count aggregate, anymore.
3
Evaluating Out Arithmetic and Instance Functions
The relational algebra-based grounding algorithm, described in Section 2, is designed for the relational (function-free) case. Below, we extend it to the case where arguments to atomic formulas may be complex terms. In this section, we present a simple method for special cases where terms do not contain expansion predicates/functions, and so they can be evaluated purely on the instance structure. Recall that an answer to a sub-formula φ(X) of a specification is an extended Xrelation R. If |X| = k, then the tuples of R have arity k. Now, consider an atomic formula whose arguments are terms containing instance functions and arithmetic operations, e.g., φ = P (x + y). As it is discussed previously, φ ⇔ ∃z(z = x + y ∧ P (z)). Although we have not discussed handling of the sub-formula z = x + y, it is apparent that the answer to φ, with free variables {x, y}, is an extended {x, y}-relation R. The extended relation R can be defined as the set of all tuples (a, b, ψ) such that a + b is in the interpretation of P . To modify the grounding algorithm of previous sub-section, we revise the base cases of definition as follows: 2
A more general version, where each variable may have its own domain, is implemented, but is more complex to explain.
Grounding Formulas with Complex Terms
19
Definition 6 (Base Cases for Atoms with Evaluable Terms). For an atomic formula φ = P (t1 , · · · , tn ) with terms t1 . . . tn and free variables X, use the following extended X-relation (which is an answer to φ wrt A): 1. P is an instance predicate: {(γ, ) | A |= P (t1 , . . . tn )[γ]} 2. P is t1 (¯ x) t2 (¯ x), where ∈ {=, <}: {(γ, ) | A |= t1 t2 [γ]} 3. P is an expansion predicate: {(γ, P (a1 , . . . an )) | A |= (t1 = a1 , . . . tn = an )[γ]} Terms involving aggregate operators, provided the formula argument to that operator contains only instance predicates and functions with a given interpretation, can also be evaluated out in this way. In example 1, this extension enables us to ground A6 .
4 Answers to Terms Terms involving expansion functions or predicates, including aggregate terms involving expansion predicates, can only be evaluated with respect to a particular interpretation of those expansion predicates/functions. Thus, they cannot be evaluated out during grounding as in Section 3 and they somehow must be represented in the ground formula. We call a term which cannot be evaluated based on the instance structure a complex term. In this section, we further extend the base cases of our relational algebra based grounding method to handle atomic formulas with complex terms. The key idea is to introduce the notion of answer to a term. The new base cases then construct an answer to the atom from the answers to the terms which are its arguments. The terms we allow here include arithmetic expressions, instance functions, expansion function, and aggregate operators involving these as well. The axioms, A4 and A5 , in example 1, have these kinds of terms. Let t be a term with free variables X, and A a σ-structure. Let R be a pair (αR , βR ) such that αR is a finite subset of N, and βR is a function mapping each element a ∈ αR to an extended X-relation βR (a). ¯ may take, Intuitively, αR is the set of all possible values that a given term t(X) βR (a) is a table representing all instantiations of X under which t might evaluate to a. We sometimes use Ra as a shorthand to βR (a). We define βR (a) = ∅ for a ∈ αR . Recall that we defined δR (γ) to be ψ iff (γ, ψ) ∈ R. We may also use δR (γ, n) and δβR (n) (γ) interchangeably. Definition 7 (Answer to term t wrt A). We say that R = (αR , βR ) is an answer to term t wrt A if, for every a ∈ αR , the extended X-relation βR (a) is an answer to the formula (t = a) wrt A, and for every a ∈ αR , the formula (t = a) is not satisfiable wrt A. Note that with this definition, αR can be either the range of t or a superset of it. Example 3. (Continue of Example 1) Let ψ(l1 , l2 ) = W (l1 ) − W (l2 ) ≤ Wl where the domains of both l1 and l2 are L = {0, 1, 2}. Let A be a σ-structure with W A = {(0 → 7), (1 → 3), (2 → 5)} and WlA = 2. Let t = Wl , ti = li , ti = W (li ) (i ∈ {1, 2}), t = t1 −t2 be the terms in ψ, and R, Ri , Ri (i ∈ {1, 2}), R be answers to these terms, respectively. Then, αR = {2}, αRi = {0, 1, 2}, αRi = {3, 5, 7} and αR = {0..4}. We now give properties that are sufficient for particular extended X-relations to constitute answers to particular terms. For a tuple X of variables of arity k, define DX to be the set of all k-tuples of domain elements, i.e., DX = Ak .
20
A. Aavani et al.
Proposition 3 (Answers to Terms). Let R be the pair (αR , βR ), and t a term . Assume that t1 , . . . tm are terms, and R1 , . . . Rm (respectively) are answers to those terms wrt A. Also, let S be an answer for φ. Then, R is an answer to t wrt A if: (1) t is a simple term (i.e., involves only variables, instance functions, and arithmetic operators): αR = {n ∈ N | ∃ a ∈ DX : (t[a] = n)} and for all n ∈ αR , βR (n) is an answer to t = n computed as described in Definition (6). (2) t is a term in form of t1 + t2 : αR = {x + y | x ∈ αR1 and y ∈ αR2 } βR (n) = ∪(j∈αR1 , k∈αR2 , n=j+k) βR1 (j) βR2 (k) (3) t is a term in form of t1 {−, ×}t2: similar to case (2); (4) t is a term in form of f (t1 , · · · , tm ), where f is an instance function: αR = {y| for some x1 ∈ αR1 , . . . , xm ∈ αRm , f (x1 , . . . , xm ) = y}, βR (n) = ∪a1 ∈αR1 ,...am ∈αRm , s.t.f (a1 ,...am )=n βR1 (a1 ) · · · βRm (am ) Intuitively, βR (n) is the combination of all possible ways that f can be evaluated as n. (5) t is a term in form of f (t1 , · · · , tm ), where f is an expansion function. We introduce an expansion predicate Ef (¯ x, y) for each expansion function f (¯ x) where type of y is the same as range of f . Then αR is equal to range of f , and βR (n) = ∪a1 ∈αR1 ,...am ∈αRm βR1 (a1 ) · · · βRm (am ) Ta1 ,··· ,am ,n Where Ta1 ,··· ,am ,n is an answer to ∃ ∧i xi = ai ∧ y = n ∧ Ef (x1 , · · · , xm , y). βR (n) expresses that f (t1 , · · · , tm ) is equal to n under assignment γ iff ti [γ] = ai and f (a1 , · · · , am ) = n. (6) t is Sumx¯ {t1 (¯ x, y¯) : φ(¯ x, y¯)} : αR = { a¯∈Dx¯ f (¯ a) : f : x¯ → {0} ∪ αR1 }. Let δ (γ, n) if n = 0 δR (γ, n) = R1 1 δR1 (γ, 0) ∨ ¬δS (γ) if n = 0 Then for each assignment ¯b : y¯ → D( y¯):
δR (¯b, n) = s.t.
δR (¯ a, ¯b, f (¯ a)) 1
x→αR a ¯ ∈D( x ¯) f :¯ f (¯ a)=n a ¯ ∈Dx ¯
For a fixed instantiation of y¯ (¯b), each instantiation of x ¯ (¯ a), might or might not contribute to the output value of the aggregate when y¯ is equal to ¯b. a ¯ contributes to the output of aggregate if B |= φ(¯ a, ¯b). δR (γ, α, f (¯ a)) describes the condition 1 under which t1 (¯ a) contributes f (¯ a) to the output. So, for a given mapping from Dx¯ to αR1 , we need to conjunct the conditions obtained from δR to find the necessary 1 and sufficient condition for one of the cases where the output sum is exactly n. And the outside disjunction, finds the complete condition. Although what is described in (7) can be used directly to find an answer for the Sum aggregates, in practice many of the entries in R will be eliminated during grounding as they are joined with false formula or unioned with a true formula. So, to reduce the grounding time, we use a place holder, SUM place holder, in the form SU M (R1 , S, n, γ), as the formula corresponding to δR (γ, n). The sum gadget is stored and propagated during grounding as a formula. After the end of the grounding phase, the
Grounding Formulas with Complex Terms
21
Table 2. Tables for Example 4: a) Answer to Wl = 2, i.e. βR (2), b) Answer to l1 = 0, i.e. βR1 (0), c) Answer to l1 = 1, i.e. βR1 (1), e) Answer to W (l1 ) = 5, i.e. βR1 (5), f) Answer to W (l2) = 7, i.e. βR2 (7), g) Answer to W (l1 ) − W (l2) = 2, i.e. βR (2), (a) ψ True
(b) l1 ψ 0 True
(c) l1 ψ 1 True
(d) l1 ψ 2 True
(e) l2 ψ 0 True
(f) l1 l1 ψ 0 1 False 2 1 True
engine enters the CNF generation phase in which a SAT instance is created from the obtained reduced grounding. In the CNF generation phase, the ground formula is traversed and by using normal Tseitin transformation [6] a corresponding CNF is generated3 . While the engine traverses the formula tree, produced in the grounding phase it might encounter a SUM place holder. If this happens, the engine passes the SUM place holder to the SUM gadget which in turn converts the SUM place holder to CNF. This design has another benefit, too. If we decided to use a SAT solver which is capable of handling Pseudo-Boolean constraints natively, the SUM gadget can easily be changed to generate the Pseudo-Boolean constraints form the SUM place holder. One can find an implementation for Sum gadget in appendix A. Example 4. (Continue From Example 3) βRi corresponds to the answer to variables li (i ∈ {1, 2}). So, R1 (R2 ) should have one free variable, namely l1 (l2 ). Having an answer to ti , an answer to ti , (αRi , βRi ), can be computed. By proposition 3, we have βR (2) = βR1 (7) βR2 (5) ∪ βR1 (5) βR2 (3). In other word, the answer to t is 2 if either t1 = 7 ∧ t2 = 5 or t1 = 5 ∧ t2 = 3. 4.1 Base Case for Complex Terms To extend our grounding algorithm to handle terms which cannot be evaluated out, we add the following base cases to the algorithm. Definition 8 (Base Case for Atoms with Complex Terms). Let t1 , · · · , tm be terms, and assume that R1 , . . . Rm (respectively) are answers to those terms wrt structure A. Then R is an answer to P (t1 , . . . tm ) wrt A if 1. P (...) is t1 = t2 : R = ∪(i∈αR1 ∩αR2 ) βR1 (i) βR2 (i) 2. P (...) is t1 ≤ t2 : R = ∪(i∈αR1 , j∈αR2 , i≤j) βR1 (i) βR2 (j) 3. P is an instance predicate: R = ∪(a1 ,··· ,am )∈P A , ai ∈αRi βR1 (a1 ) · · · βRm (am ) 4. P is an expansion predicate and R is an answer to ∃x1 . . . xm (x1 = t1 ∧ · · · ∧ xm = tm ∧ P (x1 , . . . , xm )) Example 5. (Continue Example 3) Although, ψ does not have any complex term, to demonstrate how the base cases can be handled, the process of computing an answer for ψ is described here. We have computed an answer to t , t. To compute an answer to ψ(l1 , l2 ) = t (l1 , l2 ) ≤ Wl , one needs to find the union of βR (n) βR (m) for m ∈ αt = {2} and n ≤ 2 ∈ {0..2}. In this example, {(0, 2, ), (2, 1, )} is an answer to ψ. 3
It is not the purpose of this paper to discuss the techniques we have used in this phase.
22
A. Aavani et al.
5 Experimental Evaluation In this section we report an empirical observation on the performance of an implementation of the methods we have described. Thus far, we presented our approach to grounding aggregates and arithmetic. As a motivating example, we show how haplotype inference problem[7] can be axiomatized in our grounder. To argue that the CNF generated through our grounder is efficient, we will use a well-known and optimized encoding for haplotype inference problem and show that the same CNF will be obtained without much hardship. In haplotype inference problem, we are given an integer r and a set, G, consisting of n strings in {0, 1, 2}m, for a fixed m. We are asked if there exists a set of r strings, H, in {0, 1}m such that for every g ∈ G there are two strings in H which explain g. We say two strings h1 and h2 explain an string g iff for every position 1 ≤ i ≤ m either g[i] = h1 [i] = h2 [i] or g[i] = 2 and h1 [i] = h2 [i]. The following axiomatization is intentionally produced in a way to generate the same CNF encoding as presented in [7] in the assumption that the gadget used for count is a simplified adder circuit [7]. 1. ∀i∀j (g(i, j) = 0 ⊃ ∃k ((¬h(k, j) ∨ ¬S a (k, i)) ∧ (¬h(k, j) ∨ ¬S b (k, i)))) 2. ∀i∀j (g(i, j) = 1 ⊃ ∃k ((h(k, j) ∨ ¬S a (k, i)) ∧ (h(k, j) ∨ ¬S b (k, i)))) 3. ∀i∀j (ga(i, j) = gb(i, j)) 4. ∀i∀j g(i, j) = 2 ⊃ ∃k (h(k, j) ∨ ¬ga(i, j) ∨ ¬S a (k, i)) ∧ (¬h(k, j) ∨ ga(i, j) ∨
¬S a (k, i)) ∧ (h(k, j) ∨ ¬gb(i, j) ∨ ¬S b (k, i))∧ (¬h(k, j) ∨ gb(i, j) ∨ ¬S b (k, i)) 5,6. ∀i (Countk (S a (k, i)) = 1) ∧ ∀i (Countk (S b (k, i)) = 1) In the above axiomatization, g(i, j) is an instance function which gives the character at position j of i-th string in G. The expansion predicate h(k, i) is true iff the i-th position of the k-th string in H is one. The expansion predicate S a (k, i) is true iff k-th string in H is one of the explanations for i-th string in G. S b has a similar meaning. ga(i, j) and gb(i, j) are some peripheral variables which are used in axiom (4). Table (3) shows the detailed information about running time of haplotype inference instances produced by the ms program[7]. The axiomatization given above corresponds to the row labeled with “Optimized Encoding”. The other row labeled with “Basic Encoding” also comes from the same paper [7] but as noted there and shown here produces CNF’s that take more time to solve. Table 3. Haplotyping Problem Statistics Basic Encoding Optimized Encoding
Grounding SAT Solving CNF Size 2.2 s 12.3 s 18.9 MB 1.9 s 0.95 s 13.3 MB
Thus, using our system, Enfragmo, as grounder, we have been able to describe the problem in a high level language and yet reproduce the same CNF files that have been obtained through direct reductions. Thus, Enfragmo enables us to try different reductions faster. Of course, once a good reduction is found, one can always use direct reductions to achieve higher grounding speed although, as table (3) shows, Enfragmo also has a moderate grounding time when compared to the solving time.
Grounding Formulas with Complex Terms
23
Another noteworthy point is that different gadgets show different performances under different combinations of problems and instances. So, using different gadgets also enables a knowledgeable user to choose the gadget that serves them best. The process of choosing a gadget can also be automatized through some heuristics in the grounder.
6 Related Work The ultimate goal of all systems like ours is to have a high-level syntax which ease the task for problem description for both naive and expert users. To achieve this goal, these systems should be extended to handle complex terms. As different systems use different grounding approaches, each of them should have its very specific way of handling complex terms. Essence, [8], is a declarative modelling language to specify search problems. In Essence, we do not have expansion predicates. Users need to describe their problems by expansion functions (which are variables, array of variables, matrix of variables and so on), instance predicates and mathematical operators. Then, the problem description is transformed to a Constraint Satisfying Problem, CSP, instance by an engine called Tailor. As there is no standard input format for CSP solvers, Tailor has to be developed separately for each CSP solver. Unlike SAT solvers which are just capable of handling Boolean variables, CSP solvers can work with instances in which the variables’ domains are arbitrary. In [8], a method called Flattening has been described which resembles Tseitin transformation. Flattening process describes a complex term by introducing some auxiliary variables and decomposing the complex term into simpler terms. The flattening method is also used in Zinc system [9]. IDP, [10], is system for model expansion whose input is an extension of first-order logic with inductive definitions. Essentially, the syntax of IDP is very similar to that our system but their approach to ground a given specification is different. A ground formula is created using a top-down procedure. The formula is written in Term Normal Form, TNF, in which all arguments to predicates are variables and the complex terms can only appear in the atomic formula in the form x{≤, <, =, >, ≥}t(¯ y). And then, the atomic formulas which have complex terms are rewritten as disjunction or conjunction of atomic formulas in form x < t(¯ y ) and x > t(¯ y ) [11]. The ground solver used by IDP system is an extension of regular SAT solvers which is capable of handling aggregates internally. This enables them to translate specifications and instances into their ground solver input.
7 Conclusion In model-based problem solving, users need to answer the questions in form of “What is the problem?” or “How can the problem be described?”. In this approach, systems with a high-level language helps users a lot and reduces the amount of expertise a user need to have, and thus provides a way of solving computationally hard AI problems to a wider variety of users. In this paper, we described how we extended our engine to handle complex terms. Having access to aggregates and arithmetic operators eases the task of describing problems for our system and, enables more users to work with our system for solving their theoretical and real-world problems. We also extended our grounder in such a way that it is able to convert the new construct to CNF and further showed that our grounder can reproduce the same CNF files from the high level language as the one obtains through direct reductions.
24
A. Aavani et al.
Acknowledgements The authors are grateful to the Natural Sciences and Engineering Research Council of Canada (NSERC), MITACS and D-Wave Systems for their financial support. In addition, the anonymous reviewers’ comments helped us in improving this paper and clarifying the presentation.
References 1. Frisch, A.M., Grum, M., Jefferson, C., Hernandez, B.M., Miguel, I.: The design of ESSENCE: a constraint language for specifying combinatorial problems. In: Proc. IJCAI 2007 (2007) 2. Mitchell, D., Ternovska, E.: A framework for representing and solving NP search problems. In: Proc. AAAI 2005 (2005) 3. Gent, I., Jefferson, C., Miguel, I.: Minion: A fast, scalable, constraint solver. In: ECAI 2006: 17th European Conference on Artificial Intelligence, Proceedings Including Prestigious Applications of Intelligent Systems (PAIS 2006), Riva del Garda, Italy, August 29-September 1, vol. 98, p. 98. Ios Pr. Inc., Amsterdam (2006) 4. Patterson, M., Liu, Y., Ternovska, E., Gupta, A.: Grounding for model expansion in k-guarded formulas with inductive definitions. In: Proc. IJCAI 2007, pp. 161–166 (2007) 5. Mohebali, R.: A method for solving NP search based on model expansion and grounding. Master’s thesis, Simon Fraser University (2006) 6. Tseitin, G.: On the complexity of derivation in propositional calculus. Studies in constructive mathematics and mathematical logic 2(115-125), 10–13 (1968) 7. Lynce, I., Marques-Silva, J.: Efficient haplotype inference with boolean satisfiability. In: AAAI. AAAI Press, Menlo Park (2006) 8. Rendl, A.: Effective compilation of constraint models (2010) 9. Nethercote, N., Stuckey, P., Becket, R., Brand, S., Duck, G., Tack, G.: MiniZinc: Towards a standard CP modelling language. In: Bessi`ere, C. (ed.) CP 2007. LNCS, vol. 4741, pp. 529–543. Springer, Heidelberg (2007) 10. Wittocx, J., Mari¨en, M., De Pooter, S.: The idp system (2008), Obtainable via www.cs. kuleuven.be/dtai/krr/software.html 11. Wittocx, J.: Finite domain and symbolic inference methods for extensions of first-order logic. AI Communications (2010) (accepted) 12. As´ın, R., Nieuwenhuis, R., Oliveras, A., Rodr´ıguez-Carbonell, E.: Cardinality Networks and Their Applications. In: Kullmann, O. (ed.) SAT 2009. LNCS, vol. 5584, pp. 167–180. Springer, Heidelberg (2009) 13. E´en, N.: SAT Based Model Checking. PhD thesis (2005)
A SUM Gadget We could have an implementation which constructs answers to complex terms by taking literally the conditions described in Proposition 3. However, we would expect this implementation to result in a system with poor performance. In the grounding algorithm, the function which generates ψ for tuple (γ, ψ) may produce any formula logically equivalent to φ. We may think of the function as “gadgets”, in the sense this term is used in reductions to prove NP-completeness. Choice of these gadgets is important for some constraints. For example, choosing CNF representations of aggregates is an active area of study in the SAT community (e.g., see [12]). Our method allows these choices to be made at run time, either by the user or automatically. As it described in previous sections, to compute an answer to an aggregate one needs to find a set αR ⊂ N and a function βR which maps every integer to a ground formula.
Grounding Formulas with Complex Terms
25
In section 4, we have showed what the set αR is for each term and also described the properties of the output of βR function. Here, we present one method to construct a SUM gadget which can be used during the CNF Generation phase. A gadget for Sum aggregate, denoted by S(R1 , S, n), takes an answer to a term, R1 , and an answer to a formula S, and an integer and returns a CNF formula f . Let’s assume the original Sum aggregate is t(¯ y ) = Sumx¯ {t1 (¯ x, y¯) : φ(¯ x, y¯)}, where R1 is an answer to t1 and S is an answer to φ. Let Ti = βR1 (i) S for all i ∈ αR1 . So, ψ = δTi (γ) is the necessary and sufficient condition for t1 (γ) to be equal to i. Remember that the SUM gadget is called during the CNF generation and it returns a Tseitin variable which is true iff t(...) = n. The gadget generates/retrieves a Tseitin variable, vψ for ψ(γ) if ψ = δTi (γ) = ⊥ and stores the pair (i, vψ ). After fetching all these pairs, (n1 , v1 ) · · · (nk , vk ), SUM gadget starts generating a CNF for t(γ) = n. In fact, our gadget can be any valid encoding for Pseudo-Boolean constraints such as Binary Decision Diagrams (BDD) based encoding, Sorting network based encoding and etc, [13]. Here, we describe the BDD based encoding: Let T = {(n1 , f1 ), · · · , (nk , fk )}. Define the output of the gadget to be Fkn where s Fr ’s are inductively constructed based on the following definitions: s = (Frs ∧ ¬fr+1 ) ∨ (Frs−nr+1 ∧ fr+1 ) Fr+1
if r and s are both zero Frs = ⊥ r = 0 and s = 0
Moving Object Modelling Approach for Lowering Uncertainty in Location Tracking Systems Wegdan Abdelsalam1 , David Chiu1 , Siu-Cheung Chau2 , Yasser Ebrahim2 , and Maher Ahmed2 2
1 School of Computer Science, University of Guelph Physics & Computer Science Department, Wilfrid Laurier University
Abstract. This paper introduces the concept of Moving Object (MO) modelling as a means of managing the uncertainty in the location tracking of human moving objects travelling on a network. For previous movements of the MOs, the uncertainty stems from the discrete nature of location tracking systems, where gaps are created among the location reports. Future locations of MOs are, by definition, uncertain. The objective is to maximize the estimation accuracy while minimizing the operating costs. Keywords: Moving object modelling, Managing uncertainty, Location tracking Systems.
1
Introduction
Location Tracking Systems (LTSs) are built to answer queries about the whereabouts of moving objects in the past, present, or future. To answer such queries, each moving object, monitored by the system, must report its location periodically using a sensing device such as Global Positioning System(GPS) . The location reports are then saved to a database where they are indexed to facilitate answering user queries. In spite of the continuous nature of the MO’s movement data, location data can only be acquired in a discrete time. This leaves the location of the MO unknown for the periods of time between the location reports. It is economically infeasible to capture and store a continuous stream of the location data for each MO. Rcording location reports discretly introduces uncertainty about the location of MOs between reports. Lowering uncertainty has been addressed by a number of researchers over the past few years [1–4]. These approaches try to find a link between the amount of uncertainty and the frequency of location reporting. By increasing the reporting frequency, the uncertainty can be kept within acceptable bounds. We believe that there is a need for a new approach for lowering the uncertainty without increasing the reporting frequency. This new approach must be integrated with the system database in a way that facilitates the efficient execution of the user queries. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 26–31, 2011. c Springer-Verlag Berlin Heidelberg 2011
MO Modelling Approach for Lowering Uncertainty in LTSs
27
We propose a MO modelling approach for lowering the uncertainty about MO locations in a LTS. MO modelling includes collecting information about the environment (i.e., the context) under which MOs operate. The MO model is used to reach a more accurate estimate of the MO location. This is done by estimating the location calculation parameters (e.g., speed and route) in terms of the MO’s historical data, collected for these parameters under similar circumstances.
2
Moving Object Modelling
The goal is to equip location-tracking systems with an MO model for each of the MOs being tracked. The model encompasses the MO’s characteristics, preferences, and habits. The location-tracking application utilizes the information about the MO to more accurately estimate his/her position. Because the location of an MO is determined primarily by the chosen route and the travelling speed, the MO is modelled in respect to these two variables. 2.1
MO Speed Model
Our proposed approach adopts a Bayesian Network (BN) to build the MO speed model. Figure 1 depicts a BN for the suggested MO speed model. As shown in the figure the BN structure is a single contacted DAG, most often refereed as polytree [8]. The child-node Speed is influenced by three parent nodes: the driving condition (DC), level of service (LoS), and road speed limit (SL). In turn, the driving condition is affected by two parent nodes weather condition (WC) and road type (RT). The LoS is affected by three parent node: day-of-week (DW), time-of-day (TD), and area of city (A). To build the model, the first step is to determine the possible states for each variable (i.e., node) in the Bayesian Network. It is possible to either intuitively
Road type
Weather condition
Driving condition
Day of Week
Time of Day
Level of Service
Speed
Fig. 1. Example BN for the MO speed model
Area
Speed Limit
28
W. Abdelsalam et al.
choose the values or elicit them from the domain expert. For example, the speed variable can have the finite discrete values 0, 1, 2, 3, ..., 199, 200 k/hr representing all the possible MO speeds (assuming no decimals values). For the road type, a Geographic Information System (GIS) is consulted for the possible road types in the city. The next step is to initialize the CPT with the probability of each state of the node, given the possible states of its parents. In a polytree the size of the CPT of each variable is determined by the possible states of its parent(s). Each entry (i.e. probability value) in the CPT corresponds to a combination of the parent node’s states, combined with one of the evidence. The state of the Speed variable is inferred according to the evidence of the root variables. The evidence is propagated (using Pear’s BP algorithm) down the network. The resulting probablity table is then queried for the most probable speed (i.e., the highest probability). 2.2
MO Trip Route Model
In principle, a trip route is determined by the trip source and destination. Different MOs can take different routes, based on their preferences. For example, some MOs prefer to take the shortest route, while others may prefer the fastest route. For each trip (i.e., source/destination duo) for each MO, a directed graph is built to represent the route such that a node represents a road segment, and an edge represents a connection between two road segments. Each edge is given a weight representing the probability the MO to achieve the transition from the parent node to the child node. The edge weight is based on the frequency the transition is made in relation to the total transitions from the parent node. Each edge is associated with a counter that is incremented each time the transition is made. The graph is built, based on the received location reports. If the reported road segment is on the graph, the transition frequency counter is incremented. If not, a new node is added to the graph, and its frequency number is initialized to 1. The graph nodes represent all the road segments ever visited, while on any instance of this trip. The most probable route is the shortest maximal weight path from the source node to the destination node. From the most probable route and the most probable speed on each road segment along this route, the system creates the estimated MO trajectory for the trip. Figure 2 signifies the model for the trip from source 1 to destination 13. 2.3
Estimated Trajectory Updating
Sometimes the estimated trajectory of the MO needs to be updated, based on the actual location reports received. A certain degree of deviation is detected between the estimated trajectory (based on the MO model) and the actual location reports. This deviation can occur because the MO is either following the estimated route but at a different speed than expected (called a schedule
MO Modelling Approach for Lowering Uncertainty in LTSs
29
deviation), or because the MO takes a different route from the estimated one, (called a route deviation). With either type of deviation, with continual use, the estimated trajectory for answering future queries might produce incorrect results. When an MO is detected to be off-schedule, the remainder of the estimated trajectory can be adjusted in one of two ways: If the MO is behind schedule, the remaining trajectory is shifted forward one reporting interval to reflect that the trip can take longer than expected. If the MO is ahead of schedule, the remaining trajectory is shifted backward one or more reporting interval(s) to reflect that the trip can finish sooner than expected. When a route deviation is detected, the trip route model is checked to see if there are alternate routes that have been taken by the MO in the past. By comparing the road segments travelled so far (as suggested by the actual location reports received) to the road segment sequences in the trip model, it may be possible to find a match that suggests the route the MO is actually taking.
Fig. 2. Trip model
3
Experimental Results
To experimentally validate the efficiency of using the MO modelling approach in the location estimation, query results of three different techniques to estimate the MO’s speed are compared. The three speed estimation methods that are examined are the last-reported-speed, the average-reported-speed, and the MO-model-based, most-probable-speed. The estimated speed is applied in the following formula to estimate the location of the MO (assuming the MO is moving in a straight line): Location = last reported location + (estimated speed * time elapsed since last report). Three route estimation methods are selected. The straight-line method assumes that the MO continues to move along the same line made by the last two location reports. The trip-route-model method estimates the trip route at the beginning of the trip, and the trajectory is created, according to the estimated
30
W. Abdelsalam et al.
speeds along route. The route-model-with-shifting method employs off-schedule trajectory updating, as described in Section 2.3. Each MO is randomly assigned a preference of either taking the shortest or the fastest route. Reporting intervals, between 0.25 and 3 minutes with 0.25 minute intervals, are tested. Each experiment is performed five times and the average deviation per location report (in metres) is obtained. 2000
1800
1800
1600
rse 1400 te m 1200 in no 1000 it iav 800 e D eg 600 rae v 400 A
Model Straight Line Model with Shift
200
rse 1600 te m 1400 in 1200 no tia 1000 iv eD 800 gea 600 re vA 400
Model Straight Line Model with Shift
200
0
0
0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3
Reporting Interval in Minutes
(a)
0.25 0.5 0.75
1
1.25 1.5 1.75
2
2.25 2.5 2.75
3
Reporting Interval in Minutes
(b)
Fig. 3. (a)Average error at different reporting intervals using straight-line-with-lastreported-speed, model speed, and model speed with shifting with a 0% probability of a route deviation. (b)Average error at different reporting intervals using straightline-with-last-reported-speed, model speed, and model speed with shifting with a 10% probability of a 10% route deviation.
From Figure 3, a number of observations can be made. The two model-based location estimation methods is considerably better than the straight-line-withlast-reported-speed method for reporting intervals more than 30 seconds. The straight-line-with-last-reported-speed method deviation grows linearly, as the reporting interval grows. On the other hand, both model-based methods deviation actually improves as the reporting interval grows. This is due to the fact that higher reporting intervals allow for deviations between the reported and estimated speeds (i.e., estimated speed being above/below the reported speed) to cancel each other out. The model-with-shifting tends to perform better than the model alone for shorter reporting intervals. The two converge at a reporting interval of about 1.5 to 1.75 minutes. This reveals that the proposed shifting approach does improve accuracy, as the estimated trajectories are adjusted to reflect received location reports. This effect diminishes as the reporting interval grows which means that fewer such shifts are performed.
4
Conclusion
The use of the user movement modelling in location-tracking applications is presented and the idea of MO modelling is introduced for reducing the uncertainty
MO Modelling Approach for Lowering Uncertainty in LTSs
31
about the MO’s locations. The building of the MO speed models, by employing the Bayesian Networks, is explained with a discussion of some variables affecting the design of a typical MO model-based systems. A trip route modelling approach is developed to capture the most commonly taken route between two locations. The estimated trajectory is adjusted to reflect the actual locations that are reported. Experimental evidence is produced to confirm that both speed and trip route modelling do help in increasing the accuracy of the location estimation compared with the traditional approach (last reported speed and straight line direction estimation). The same is also shown to be true when the speed is estimated by averaging of the last k reported speeds (rather than considering the last reported speed).
References 1. Wolfson, O., Sistla, A.P., Xu, B., Zhou, J., Chamberlain, S., Yesha, Y., Rishe, N.: Tracking moving objects using database technology in domino. In: Tsur, S. (ed.) NGITS 1999. LNCS, vol. 1649, pp. 112–119. Springer, Heidelberg (1999) 2. Ding, H., Trajcevski, G., Scheuermann, P.: Efficient maintenance of continuous queries for trajectories. Geoinformatica 12(3), 255–288 (2008) 3. Moreira, J., Ribeiro, C., Abdessalem, T.: Query operations for moving objects database systems. In: Proceedings of the 8th International Symposium on Advances in Geographic Information Systems (ACMGIS 2000), Washington, D.C, USA, November 6-11, pp. 108–114. ACM Press, New York (2000) 4. Sistla, A.P., Wolfson, O., Chamberlain, S., Dao, S.: Querying the uncertain position of moving objects. In: Etzion, O., Jajodia, S., Sripada, S.M. (eds.) Proceedings of Dagstuhl seminar on Temporal Databases: Research and Practise, Dagstuhl Castle, Germany, June 23-27, pp. 310–337. Springer, Heidelberg (1997) 5. Patterson, D.J., Fox, D., Kautz, H., Fishkin, K., Perkowitz, M., Philipose, M.: Contextual computer support for human activity. In: Spring Symposium on Interaction Between Humans and Autonomous Systems over Extended Operation (AAAI 2004), Stanford, CA, USA (2004), http://www.aaai.org/Library/ Symposia/Spring/2004/ss04-03-013.php 6. Darwiche, A.: Modeling and Reasoning with Bayesian Networks, 1st edn. Cambridge University Press, New York (2009) 7. Cooper, G.F.: Probabilistic inference using belief networks is np-hard. Technical report SMI-87-0195, Knowledge Systems Laboratory, Stanford University, Stanford, CA, USA (1987) 8. Pearl, J.: Fusion, propagation, and structuring in belief networks. Artificial Intelligence 29(3), 241–288 (1986) 9. Heckerman, D.: A tutorial on learning with bayesian networks. TechReport MSRTR-95-06, Microsoft Research, Microsoft Corp., Seattle, Washington (1995) 10. Neapolitan, R.E.: Learning Bayesian Networks. Prentice Hall, Englewood Cliffs (2003)
Unsupervised Relation Extraction Using Dependency Trees for Automatic Generation of Multiple-Choice Questions Naveed Afzal1, Ruslan Mitkov1, and Atefeh Farzindar2 1
Research Institute for Information and Language Processing (RIILP) University of Wolverhampton, Wolverhampton, UK {N.Afzal,R.Mitkov}@wlv.ac.uk 2 NLP Technologies Inc. 1255 University Street, Suite 1212 Montreal (QC), Canada, H3B 3W9
[email protected]
Abstract. In this paper, we investigate an unsupervised approach to Relation Extraction to be applied in the context of automatic generation of multiplechoice questions (MCQs). MCQs are a popular large-scale assessment tool making it much easier for test-takers to take tests and for examiners to interpret their results. Our approach to the problem aims to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations. In this paper, we present an approach to learn semantic relations between named entities by employing a dependency tree model. Our findings indicate that the presented approach is capable of achieving high precision rates, which are much more important than recall in automatic generation of MCQs, and its enhancement with linguistic knowledge helps to produce significantly better patterns. The intended application for the method is an e-Learning system for automatic assessment of students’ comprehension of training texts; however it can also be applied to other NLP scenarios, where it is necessary to recognise the most important semantic relations without any prior knowledge as to their types. Keywords: E-Learning, Information Extraction, Relation Extraction, Biomedical domain, Dependency Tree, MCQ generation.
1 Introduction Multiple choice questions (MCQs) also known as multiple-choice tests are a form of objective assessment in which a user selects one answer from a set of alternative choices for a given question. MCQs are straightforward to conduct and instantaneously provide an effective measure of test-takers performance and feedback test results to the learner. In many disciplines instructors use MCQs as a preferred assessment tool and it is estimated that 45% - 67% student assessments utilise MCQs [2]. The fast developments of e-Learning technologies have in turn stimulated method for automatic generation of MCQs and today they have become an actively developing topic in C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 32–43, 2011. © Springer-Verlag Berlin Heidelberg 2011
Unsupervised RE Using Dependency Trees for Automatic Generation of MCQs
33
application-oriented NLP research. The work done in the area of automatic generation of MCQs does not have a long history [e.g., 18, 19, 28, 3 and 10]. Most of the aforementioned approaches rely on the syntactic structure of a sentence. We present a new approach to MCQs generation, where in order to automatically generate MCQs we first identify important concepts and the relationships between them in the input texts. In order to achieve this, we study unsupervised Information Extraction methods with the purpose of discovering the most significant concepts and relations in the domain texts, without any prior knowledge of their types or their exemplar instances (seeds). Information Extraction (IE) is an important problem in many information access applications. The goal is to identify instances of specific semantic relations between named entities of interest in the text. Named Entities (NE’s) are generally noun phrases in the unstructured text e.g. names of persons, posts, locations and organisations while relationships between two or more NE’s are described in a pre-defined way e.g. “interact with” is a relationship between two biological objects (proteins). Dependency trees are regarded as a suitable basis for semantic patterns acquisition as they abstract away from the surface structure to represent relations between elements (entities) of a sentence. Semantic patterns represent semantic relations between elements of sentences. In a dependency tree a pattern is defined as a path in the dependency tree passing through zero or more intermediate nodes within a dependency tree [27]. An insight of usefulness of the dependency patterns was provided by [26] in their work as they revealed that dependency parsers have the advantage of generating analyses which abstract away from the surface realisation of text to a greater extent than phrase structure grammars tend to, resulting in semantic information being more accessible in the representation of the text which can be useful for IE. The main advantage of our approach is that it can cover a potentially unrestricted range of semantic relations while most supervised and semi-supervised approaches can learn to extract only those relations that have been exemplified in annotated text, seed patterns. Our assumption for Relation Extraction (RE) is that it is between NE’s stated in the same sentence and that presence or absence of relation is independent of the text prior to or succeeding the sentence. Moreover, our approach is suitable in situations where a lot of unannotated text is available as it does not require manually annotated text or seeds. These properties of the method can be useful, specifically, in such applications as MCQs generation [18, 19] or a pre-emptive approach in which viable IE patterns are created in advance without human intervention [23, 24].
2 Related Work There is a large body of research dedicated to the problem of extracting relations from texts of various domains. Most previous work focused on supervised methods and tried to both extract relations and assign labels describing their semantic types. As a rule, these approaches required a manually annotated corpus, which is very laborious and time-consuming to produce. Semi-supervised and unsupervised approaches relied on seeds patterns and/or examples of specific types of relations [1, 25]. An unsupervised approach based on
34
N. Afzal, R. Mitkov, and A. Farzindar
clustering of candidate patterns for the discovery of the most important relation types among NE’s from a newspaper domain was presented by [9]. In the biomedical domain, most approaches were supervised and relied on regular expressions to learn patterns [5], while semi-supervised approaches exploited pre-defined seed patterns and cue words [11, 17]. Several approaches in IE have relied on dependency trees in order to extract patterns for the automatic acquisition of IE systems [27, 25 and 7]. Apart from IE, [15] used dependency trees in order to infer rules for question answering while [29] had made use of dependency trees for paraphrase identification. Moreover, dependency parsers are used most recently in the systems which identify protein interactions in biomedical texts [13, 6]. In dependency parsing main objective is to describe syntactic analysis of a sentence using dependency links which shows the head-modifier relations between words. All the IE approaches that relied on dependency trees have used different pattern models based on the particular part of the dependency analysis. The motive behind all of these models is to extract the necessary information from text without being overly complex. All of the pattern models have made use of the semantic patterns based on the dependency trees for the identification of items of interest in text. These models vary in terms of their complexity, expressivity and performance in an extraction scenario.
3 Our Approach Our approach is based on the Linked Chain Pattern Model presented by [7]. Linked Chain Pattern Model combines the pair of chains in a dependency tree which share common verb root but no direct descendants. In our approach, we have treated every NE as a chain in a dependency tree if it is less than 5 dependencies away from the verb root and the word linking the NE’s to the verb root are from the category of content words (Verb, Noun, Adverb and Adjective) along with prepositions. We consider only those chains in the dependency tree of a sentence which contain NE’s, which is much more efficient than the subtree model of [27], where all subtrees containing verbs are taken into account. This allows us to extract more meaningful patterns from the dependency tree of a sentence. We extract all NE chains which follow aforementioned rule from a sentence and combine them together. Figure 1 shows the whole system architecture. According to the system architecture, in Section 3, we elaborate the NER process, Section 4 explains the process of candidates patterns extraction, we use GENIA corpus for candidate patterns extraction. Section 5 describes various information theoretic measures and statistical tests for patterns ranking depending upon their associations with domain corpus. Section 6 discusses the evaluation procedures (rankthresholding and score-thresholding); GENIA EVENT Annotation corpus is used for evaluation while Section 7 explains the experimental results obtained via various patterns ranking methods.
Unsupervised RE Using Dependency Trees for Automatic Generation of MCQs
Unannotated corpus
Named Entity Recognition
Extraction of Candidate Patterns
Patterns Ranking
35
Evaluation
Semantic Relations
Fig. 1. System Architecture
4 Named Entity Recognition (NER) NER is an integral part of any IE system as it identifies NE’s present in a text. Presently many NER tools are developed for various domains as there is a lot of research being done in the area of NER spreading across various languages, domains and textual genres. In our work, we used biomedical data as biomedical NER is generally considered to be more difficult as compared to other domains like newswire text. There are huge numbers of NE’s in the biomedical domain and the new ones are consistently added [32] which means that neither dictionaries nor training data approach will be sufficiently comprehensive for NER task. The volume of published biomedical research is expanding at a rapid rate in the recent past. Due to the syntactic and semantic complexity of biomedical domain many IE systems have utilised tools (e.g., part-of-speech tagger, NER, parsers) specifically designed and developed for the biomedical domain [21]. Moreover, [8] presented a report, investigating the suitability of current NLP resources for syntactic and semantic analysis for biomedical domain. The GENIA NER1 [31, 32] is a specific tool designed for biomedical texts; the NE tagger is designed to recognise mainly the following NE’s: protein, DNA, RNA, cell_type and cell_line. Table 1 shows the performance of GENIA NER3. Table 1. GENIA NER Performance Entity Type Protein DNA RNA Cell Type Cell Line Overall
1
Precision
65.82 65.64 60.45 56.12 78.51 67.45
Recall 81.41 66.76 68.64 59.60 70.54 75.78
F-score 72.79 66.20 64.29 57.81 74.31 71.37
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
36
N. Afzal, R. Mitkov, and A. Farzindar
5 Extraction of Candidate Patterns Our general approach to learn dependency tree-based patterns consists of two main stages: (i) the construction of potential patterns from an unannotated domain corpus and (ii) their relevance ranking. After NER the next step is the construction of candidate patterns. We will explain the whole process of candidate patterns extraction from the dependency trees with the help of an example shown below: Fibrinogen activates NF-kappaB transcription factors in mononuclear phagocytes. After the NER the aforementioned sentence is transformed into following: <protein> Fibrinogen activates <protein> NF-kappaB <protein> transcription factors in
mononuclear phagocytes . Once the NE’s are recognised in the domain corpus by the GENIA tagger, we replace all the NE’s with their semantic class respectively, so the aforementioned sentence is transformed into following sentence. PROTEIN activates PROTEIN PROTEIN in CELL. The transformed sentences are then parsed by using the Machinese Syntax2 parser [30]. Machinese Syntax parser uses a functional dependency grammar for parsing. The analyses produced by the Machinese Syntax parser are encoded to make the most of information they contain and ensure consistent structures from which patterns could be extracted. Figure 2 shows the dependency tree for the aforementioned adapted sentence:
Fig. 2. Example of a dependency tree
After the encoding process, the patterns are extracted from dependency trees using the methodology describe in Section 3. From Figure 2, the following patterns are extracted:
"PROTEIN" <W ID="1" func="+FMAINV" Dep="none">"activate"
"PROTEIN" "PROTEIN" <W ID="0" func="+FMAINV" Dep="none">"activate"
"PROTEIN" "PROTEIN" 2
http://www.connexor.com/software/syntax/
Unsupervised RE Using Dependency Trees for Automatic Generation of MCQs
37
<W ID="0" func="+FMAINV" Dep="none">"activate"
"PROTEIN" <W ID="2" func="PREP" Dep="0">"in"
"CELL_TYPE" Here
tag represents the Named Entity (semantic class) while <W> tag represent the lexical words while ID represent the word id, func represent function of the word and Dep represents the id of the word on which this word depends in a dependency tree. The extracted patterns along with their frequencies are then stored in a database. We filtered out the patterns containing only stop-words in dependencybased patterns using stop-words corpus. Table 2 shows the examples of dependencybased patterns along with their frequencies. Table 2. Example of dependency-based patterns along with frequencies Patterns "DNA" <W ID="1" func="+FMAINV" Dep="none">"contain" "DNA" "PROTEIN" <W ID="1" func="+FMAINV" Dep="none">"activate" "PROTEIN" "PROTEIN" <W ID="1" func="+FMAINV" Dep="none">"contain" "PROTEIN" "PROTEIN" "PROTEIN" <W ID="2" func="+FMAINV" Dep="none">"induce"
Frequency 34
32
19
19
6 Pattern Ranking After candidate patterns have been constructed, the next step is to rank the patterns based on their significance in the domain corpus. The ranking methods we use require a general corpus that serves as a source of examples of pattern use in domainindependent texts. To extract candidates from the general corpus, we treated every noun as a potential NE holder and the candidate construction procedure described above was applied to find potential patterns in the general corpus. In order to score candidate patterns for domain-relevance, we measure the strength of association of a pattern with the domain corpus as opposed to the general corpus. The patterns are scored using the following methods for measuring the association between a pattern and the domain corpus: Information Gain (IG), Information Gain Ratio (IGR), Mutual Information (MI), Normalised Mutual Information (NMI)3, Log-likelihood (LL) and 3
Mutual Information has a well-known problem of being biased towards infrequent events. To tackle this problem, we normalised the MI score by a discounting factor, following the formula proposed in Lin and Pantel (2001).
38
N. Afzal, R. Mitkov, and A. Farzindar
Chi-Square (CHI). These association measures were included in the study as they have different theoretical principles behind them: IG, IGR, MI and NMI are information-theoretic concepts while LL and CHI are statistical tests of association. Information Gain measures the amount of information obtained about domain specialisation of corpus c, given that pattern p is found in it.
IG( p, c) =
P( g , d )
∑ ∑ P( g, d ) log P( g )P(d ) { } { }
d ∈ c , c ' g∈ p , p '
where p is a candidate pattern, c – the domain corpus, p' – a pattern other than p, c' – the general corpus, P(c) – the probability of c in “overall” corpus {c,c'}, and P(p) – the probability of p in the overall corpus. Information Gain Ratio aims to overcome one disadvantage of IG consisting of the fact that IG grows not only with the increase of dependence between p and c, but also with the increase of the entropy of p. IGR removes this factor by normalizing IG by the entropy of the patterns in the corpora:
IGR( p, c) =
IG( g , c) − ∑ P( g ) log P( g ) g ∈{ p , p '}
Pointwise Mutual Information between corpus c and pattern p measures how much information the presence of p contains about c, and vice versa:
MI ( p , c ) = log
P ( p, c) P ( p ) P (c )
Chi-Square and Log-likelihood are statistical tests which work with frequencies and rank-order scales, both calculated from a contingency table with observed and expected frequency of occurrence of a pattern in the domain corpus. Chi-Square is calculated as follows:
x 2 (p , c ) =
∑
d ∈ {c , c '}
(O d
− Ed Ed
)2
where O is the observed frequency of p in domain and general corpus respectively and E is the expected frequency of p in two corpora. Log-likelihood is calculated according to the following formula:
⎛ ⎛O ⎞ ⎛O LL ( p , c ) = 2⎜⎜ O1 log ⎜⎜ 1 ⎟⎟ + O2 log ⎜⎜ 2 ⎝ E1 ⎠ ⎝ E2 ⎝
⎞⎞ ⎟⎟ ⎟ ⎟ ⎠⎠
where O1 and O2 are observed frequencies of p in the domain and general corpus respectively, while E1 and E2 are its expected frequency values in the two corpora. In addition to these six measures, we introduce a meta-ranking method that combines the scores produced by several individual association measures, in order to
Unsupervised RE Using Dependency Trees for Automatic Generation of MCQs
39
leverage agreement between different association measures and downplay idiosyncrasies of individual ones. Because the association functions range over different values (for example, IGR ranges between 0 and 1, and MI between +∞ and -∞), we first normalise the scores assigned by each method4:
s norm ( p ) =
s( p) max q ∈ P ( s ( q ))
where s(p) is the non-normalised score for pattern p, from the candidate pattern set P. The normalised scores are then averaged across different methods and used to produce a meta-ranking of the candidate patterns. Given the ranking of candidate patterns produced by a scoring method, a certain number of highest-ranking patterns can be selected for evaluation. We studied two different ways of selecting these patterns: (i) one based on setting a threshold on the association score below which the candidate patterns are discarded (henceforth, scorethresholding method) and (ii) one that selects a fixed number of top-ranking patterns (henceforth, rank-thresholding method). During the evaluation, we experimented with different rank- and score-thresholding values.
7 Evaluation Biomedical NE’s are expressed in various linguistic forms such as abbreviations, plurals, compound, coordination, cascades, acronyms and apposition. Sentences in such texts are syntactically complex as the subsequent Relation Extraction phase depends upon the correct identification of the named entities and correct analysis of linguistic constructions expressing relations between them [34]. We used the GENIA Corpus as the domain corpus while British National Corpus (BNC) was used as a general corpus. GENIA corpus consists of 2,000 abstracts extracted from the MEDLINE containing 18,477 sentences. In the evaluation phase, GENIA EVENT Annotation corpus5 is used [14]. It consists of 9,372 sentences. The numbers of dependency patterns extracted from each corpus are: GENIA 5066, BNC 419274 and GENIA EVENT 3031 respectively. In order to evaluate the quality of the extracted patterns, we examined their ability to capture pairs of related NE’s in the manually annotated evaluation corpus, without recognising the type of semantic relation. Selecting a certain number of best-ranking patterns, we measure precision, recall and F-score. To test the statistical significance of differences in the results of different methods and configurations, we used a paired t-test, having randomly divided the evaluation corpus into 20 subsets of equal size; each subset containing 461 sentences on average.
8 Results Table 3 shows the results of precision scores for ranked-thresholding method. 4 5
Patterns with negative MI scores are discarded. http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page= Event+Annotation
40
N. Afzal, R. Mitkov, and A. Farzindar Table 3. Precision scores of rank-thresholding method Ranking Methods
IG IGR MI NMI LL CHI Meta
Dependency Tree Patterns Top 100 Ranked Patterns 0.770 0.770 0.560 0.940 0.770 0.960 0.900
Top 200 Ranked Patterns 0.800 0.800 0.560 0.815 0.800 0.815 0.830
Top 300 Ranked Patterns 0.780 0.787 0.540 0.707 0.790 0.710 0.740
Table 4 shows the results of score-thresholding method, the left side of the Table 4 shows the precision (P), recall (R) and F-score values for score-threshold values where we are able to achieve high F-scores while right side of the Table 4 shows the high precision scores. Table 4. Results of score-thresholding method Ranking Methods
IG IGR MI NMI LL CHI Meta IG IGR MI NMI LL CHI Meta IG IGR MI NMI LL CHI Meta
Dependency Tree Patterns P R Threshold score > 0.01 0.748 0.107 0.748 0.107 0.567 0.816 0.566 0.767 0.748 0.107 0.577 0.529 0.571 0.643 Threshold score > 0.02 0.796 0.051 0.796 0.051 0.566 0.744 0.570 0.706 0.796 0.051 0.591 0.243 0.569 0.547 Threshold score > 0.03 0.785 0.035 0.785 0.035 0.566 0.711 0.568 0.663 0.785 0.035 0.613 0.146 0.577 0.355
F-score 0.187 0.187 0.669 0.651 0.187 0.552 0.605 0.097 0.097 0.643 0.631 0.097 0.344 0.558 0.067 0.067 0.631 0.612 0.067 0.236 0.439
P R F-score Threshold score > 0.09 0.733 0.007 0.014 0.733 0.007 0.014 0.563 0.593 0.578 0.572 0.507 0.538 0.733 0.007 0.014 0.036 0.069 0.900 0.860 0.048 0.092 Threshold score > 0.1 0.704 0.006 0.012 0.704 0.006 0.012 0.564 0.588 0.576 0.569 0.483 0.523 0.704 0.006 0.012 0.035 0.067 0.898 0.856 0.047 0.089 Threshold score > 0.2 0.571 0.003 0.005 0.571 0.003 0.005 0.566 0.473 0.515 0.600 0.133 0.218 0.571 0.003 0.005 0.015 0.029 1.000 1.000 0.013 0.025
Unsupervised RE Using Dependency Trees for Automatic Generation of MCQs
41
In both tables (3 and 4), the results of the best performing ranking method in terms of precision are shown in bold font. Although our main focus is on achieving higher precision scores it is quite obvious from Table 4 that our method achieved low recall, one reason of having a low recall is due to the small size of GENIA corpus which can be encountered by using a large corpus as large corpus will produce much greater number of patterns and increase the recall. The CHI and NMI are the best performing ranking methods in terms of precision in both rank-thresholding and score-thresholding method while IG, IGR and LL achieve quite similar results. Moreover in Table 4 we are able to achieve 100% precision. Figure 3 shows the precision scores for the best performing ranking methods (CHI and NMI) in score-thresholding method. 1 0 .8 0 .6
CHI
0 .4
NM I
0 .2 0 >0 . 0 8
>0 . 0 9
>0 . 1
>0 . 2
>0 . 3
>0 . 4
>0 . 5
Fig. 3. Example of a dependency tree
The literature on the topic suggests that IGR performs better than the IG [22, 16]; we found that in general there is no statistically significant difference between IG and IGR, IGR and LL. In both sets of experiments, obviously due to the aforementioned problem, MI performs quite poorly; the normalised version of MI helps to alleviate this problem. Moreover, there exists a statistically significant difference (p < 0.01) between NMI and the other ranking methods. The meta-ranking method did not improve on the best individual ranking method as expected. We also find out that score-thresholding method produces better results than rankthresholding as we are able to achieve up to 100% precision with the former technique. High precision is quite important in applications such as MCQ generation. In scorethresholding, it is possible to optimise for high precision (up to 100%), though recall and F-score is generally quite low. MCQ applications rely on the production of good questions rather than the production of all possible questions, so high precision plays a vital role in such applications.
9 Future Work In the future, we plan to employ the RE method for automatic MCQ generation, where it will be used to find relations and NE’s in educational texts that are important for testing students’ familiarity with key facts contained in the texts. In order to achieve this, we needed an IE method that has a high precision and at the same time works with unrestricted semantic types of relations (i.e. without reliance on seeds), while recall is of secondary importance to precision. The distractors will be produced using distributional similarity measures.
42
N. Afzal, R. Mitkov, and A. Farzindar
10 Conclusion In this paper, we have presented an unsupervised approach for RE from dependency trees intended to be deployed in an e-Learning system for automatic generation of MCQs by employing semantic patterns. We explored different ranking methods and found that the CHI and NMI ranking methods obtained higher precision than the other ranking methods. We employed two techniques: the rank-thresholding and scorethresholding and found that score-thresholding perform better.
References 1. Agichtein, E., Gravano, L.: Snowball: Extracting Relations from Large Plaintext Collections. In: Proc. of the 5th ACM International Conference on Digital Libraries (2000) 2. Becker, W.E., Watts, M.: Teaching methods in U.S. and undergraduate economics courses. Journal of Economics Education 32(3), 269–279 (2001) 3. Brown, J., Frishkoff, G., Eskenazi, M.: Automatic question generation for vocabulary assessment. In: Proc. of HLT/EMNLP, Vancouver, B.C. (2005) 4. Cohen, A.M., Hersh, W.R.: A Survey of Current Work in Biomedical Text Mining. Briefings in Bioinformatics, 57–71 (2005) 5. Corney, D.P., Jones, D., Buxton, B., Langdon, W.: BioRAT: Extracting Biological Information from Full-length Papers. Bioinformatics, 3206–3213 (2004) 6. Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proc. of CoNLL-EMNLP (2007) 7. Greenwood, M., Stevenson, M., Guo, Y., Harkema, H., Roberts, A.: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System. In: Proc. of the 4th Learning Language in Logic Workshop, Bonn, Germany (2005) 8. Grover, C., Lascarides, A., Lapata, M.: A Comparison of Parsing Technologies for the Biomedical Domain. Natural Language Engineering 11(1), 27–65 (2005) 9. Hasegawa, T., Sekine, S., Grishman, R.: Discovering relations among named entities from large corpora. In: Proc. of ACL 2004 (2004) 10. Hoshino, A., Nakagawa, H.: A Real-time Multiple-choice Question Generation for Language Testing – A Preliminary Study. In: Proc. of the 43rd ACL 2005 2nd Workshop on Building Educational Applications Using Natural Language Processing, Ann Arbor, U.S., pp. 17–20 (2005) 11. Huang, M., Zhu, X., Payan, G.D., Qu, K., Li, M.: Discovering patterns to extract proteinprotein interactions from full biomedical texts. Bioinformatics, 3604–3612 (2004) 12. Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn. Prentice Hall, Englewood Cliffs (2008) 13. Katrenko, S., Adriaans, P.: Learning relations from biomedical corpora using dependency trees. In: Tuyls, K., Westra, R.L., Saeys, Y., Nowé, A. (eds.) KDECB 2006. LNCS (LNBI), vol. 4366, pp. 61–80. Springer, Heidelberg (2007) 14. Kim, J.-D., Ohta, T., Tsujii, J.: Corpus Annotation for Mining Biomedical Events from Literature, BMC Bioinformatics (2008) 15. Lin, D., Pantel, P.: Concept Discovery from Text. In: Proc. of Conference on CL 2002, Taipei, Taiwan, pp. 577–583 (2002) 16. Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
Unsupervised RE Using Dependency Trees for Automatic Generation of MCQs
43
17. Martin, E.P., Bremer, E., Guerin, G., DeSesa, M.-C., Jouve, O.: Analysis of Protein/Protein Interactions through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Text Articles, pp. 96–108. Springer, Berlin (2004) 18. Mitkov, R., An, L.A.: Computer-aided generation of multiple-choice tests. In: Proc. of the HLT/NAACL 2003 Workshop on Building educational applications using Natural Language Processing, Edmonton, Canada, pp. 17–22 (2003) 19. Mitkov, R., Ha, L.A., Karamanis, N.: A computer-aided environment for generating multiple-choice test items. Natural Language Engineering 12(2), 177–194 (2006) 20. Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated Extraction of Information on Protein–Protein Interactions from the Biological Literature. Bioinformatics, 155–161 (2001) 21. Pustejovsky, J., Casta, J., Cochran, B., Kotecki, M.: Robust relational parsing over biomedical literature: Extracting inhibit relations. In: Proc. of the 7th Annual Pacific Symposium on Bio-computing (2002) 22. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986) 23. Sekine, S.: On-Demand Information Extraction. In: Proc. of the COLING/ACL (2006) 24. Shinyama, Y., Sekine, S.: Preemptive Information Extraction using Unrestricted Relation Discovery. In: Proc. of the HLT Conference of the North American Chapter of the ACL, New York, pp. 304–311 (2006) 25. Stevenson, M., Greenwood, M.: A Semantic Approach to IE Pattern Induction. In: Proc. of ACL 2005, pp. 379–386 (2005) 26. Stevenson, M., Greenwood, M.: Dependency Pattern Models for Information Extraction. Research on Language and Computation (2009) 27. Sudo, K., Sekine, S., Grishman, R.: An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition. In: Proc. of the 41st Annual Meeting of ACL 2003, Sapporo, Japan, pp. 224–231 (2003) 28. Sumita, E., Sugaya, F., Yamamoto, S.: Measuring non-native speakers’ proficiency of English using a test with automatically-generated fill-in-the-blank questions. In: Proc. of the 2nd Workshop on Building Educational Applications using NLP, pp. 61–68 (2005) 29. Szpektor, I., Tanev, H., Dagan, I., Coppola, B.: Scaling Web-based acquisition of Entailment Relations. In: Proc. of EMNLP 2004, Barcelona, Spain (2004) 30. Tapanainen, P., Järvinen, T.: A Non-Projective Dependency Parser. In: Proc. of the 5th Conference on Applied Natural Language Processing, Washington, pp. 64–74 (1997) 31. Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005) 32. Tsuruoka, Y., Tsujii, J.: Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. In: Proc. of HLT/EMNLP, pp. 467–474 (2005) 33. Wilbur, J., Smith, L., Tanabe, T.: BioCreative 2. Gene Mention Task. In: Proc. of the 2nd Bio-Creative Challenge Workshop, pp. 7–16 (2007) 34. Zhou, G., Su, J., Shen, D., Tan, C.: Recognizing Name in Biomedical Texts: A Machine Learning Approach. Bioinformatics, 1178–1190 (2004)
An Improved Satisfiable SAT Generator Based on Random Subgraph Isomorphism Cˇalin Anton Grant MacEwan University, Edmonton, Alberta, Canada
Abstract. We introduce Satisfiable Random High Degree Subgraph Isomorphism Generator(SRHD-SGI), a variation of the Satisfiable Random Subgraph Isomorphism Generator (SR-SGI). We use the direct encoding to translate the SRHD-SGI instances into Satisfiable SAT instances. We present empirical evidence that the new model preserves the main characteristics of SAT encoded SR-SGI: easy-hard-easy pattern of evolution and exponential growth of empirical hardness. Our experiments indicate that SAT encoded SRHD-SGI instances are empirically harder than their SR-SGI counterparts. Therefore we conclude that SRHD-SGI is an improved generator of satisfiable SAT instances.
1
Introduction
Satisfiability (SAT) - checking if a Boolean formula is satisfiable - is one of the most important problems in Artificial Intelligence. It has important practical applications such as formal verification of software and hardware. Incomplete SAT solvers, can not prove unsatisfiability, but they may find a solution if one exists. New, challenging testbeds are needed for improving the performances of these solvers, and to differentiate between their performances. Several models have been proposed for generating hard SAT instances including some which are based on generating graphs [1,2], but there are only a few such models for generating Satisfiable SAT instances [3,4]. Given a pair of graphs, the Subgraph Isomorphism Problem (SGI) asks if one graph is isomorphic to a subgraph of the other graph. It is an NP-complete problem with many applications in areas like pattern recognition, computer aided design and bioinformatics. SAT encoded random Subgraph Isomorphism instances are known to be empirically hard, in terms of running time, for state of the art SAT solvers [5,6]. SR-SGI[5] is a model for generating Satisfiable SAT instances, by converting randomly generated satisfiable instances of SGI. The model has the following features: a) it generates relatively hard satisfiable SAT instances; b) the empirical hardness of the instances, exhibits an easy-hard-easy pattern when plotted against one of the model’s parameters; c) the hardness of the instances at the hardness peak increase exponentially with the instance size. In this paper we introduce SRHD-SGI, a variation of SR-SGI which aims to produce harder instances by: reducing the number of possible solutions and C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 44–49, 2011. c Springer-Verlag Berlin Heidelberg 2011
An Improved Satisfiable SAT Generator
45
eliminating the tell-tales - which are indicators of an easy to find solution. The above goals are acquired by generating a “flatter” subgraph. The main difference between SRHD-SGI and SR-SGI resides in the way the subgraph is generated: SR-SGI randomly selects the subgraph while SRHD-SGI selects the subgraph induced by the highest degree vertices.
2
A Random Model for Generating Satisfiable SGI Instances
For integers n, m, and q such that 0 ≤ n ≤ m, 0 ≤ q ≤ m , and p ∈ [0, 1], 2 a (m, q, n, p) satisfiable random high degree SGI (SRHD-SGI) consists of two graphs G and H and asks if G is isomorphic to a subgraph of H. H is a random graph with m vertices and q edges. G is obtained by the following steps: 1. Select the n highest degree vertices of H - breaking ties at random . 2. Make G the induced subgraph of H, with the vertices selected in step 1. 3. Remove edges from G in decreasing order of the sum of the degrees of its adjacent vertices - breaking ties at random. 4. Make G a random isomorphic copy of G , by randomly shuffling G. The SGI instance is simplified using PP a preprocessing procedure introduced in [7]. SRHD-SGI is a variation of the SR-SGI model, which removes edges from the induced graph at random. The main difference between SRHD-SGI and SR-SGI resides in the way G is generated. Reducing the number of possible solutions1 is the reason for making G the subgraph induced by the highest degree vertices of H. If G is a dense graph, then it is conceivable that there will not be many subgraphs of H which are isomorphic to G and thus there will not be many solutions. The high degree vertices of G may become tell-tales - “ signposts that may help, at least statistically, to determine the hidden solution” [8] which negatively affects the instance hardness. Hiding these tell-tales is the main reason for our choice of edge removal method. It also reduces the variance in the degree sequence of G, and such, it reduces the local variation. Reducing variation in local structure has been applied to SAT [9], resulting in hard random instances. If p = 0, no edge is removed from G and thus G is the subgraph of H induced by the highest degree vertices. In this case it is very likely that the SGI instance has a single solution. Furthermore, the presence of the tell-tales should quickly guide the search toward the unique solution. As such, the instances generated for p = 0 are not expected to be very difficult. If p = 1, G has no edges and therefore it is isomorphic to any subgraph of H with n vertices and no edges. Based on these assumptions it is expected that the hardest instances of SRHD-SGI are obtained for values of p strictly between 0 and 1. 1
It has been implied [4] that the number of solutions is negatively correlated with the difficulty of randomly generated satisfiable instances.
46
C. Anton Time(ms) VS p
Time(ms) VS p
1.2e+06
90000 SATzilla2009_C SATzilla2009_I SApperloTbase
Median Time(ms)
1e+06
clasp novelty picosat precosat
80000 70000
800000
60000 50000
600000
40000 400000
30000 20000
200000
10000 0 0
0 10
20
30 p
40
50
0
10
20
30
40
50
p
Fig. 1. Evolution of empirical hardness of SAT encoded SRHD-SGI with p.(m=19, q=145, n=17). Notice the different time scales.
3
Empirical Investigation of SAT Encoded SRHD-SGI
In this section we provide experimental evidence indicating that SRHD-SGI preserves the main characteristics of SR-SGI and produces empirically harder SAT encoded instances than SR-SGI. We used the following experimental framework: m was m set to 16,m17, 18, 19, and 20; n varied from m−5 to m; q varied between 0.60 and 0.90 2 2 in incre ments of 0.05 m ; p varied between 0% and 50% in increments of 5%. We used a 2 cutoff time limit of 900 seconds. For each combination of values for m, q, n and p, 100 test samples were generated. The running time (in milliseconds) was used to estimate the empirical hardness of the instances. For performing the experiments we chose solvers2 which won the gold and silver medals at the last SAT competition[10] in the SAT and SAT+UNSAT categories of the Application and Crafted tracks: Clasp, precosat, SAperloT , SATZilla2009 I and SATZilla2009 C. To simplify the comparison with SR-SGI we added to the solvers pool picosat and gnovelty+ , which were used in the SR-SGI experiments[5]. 3.1
SRHD-SGI Preserves the Main Characteristics of SR-SGI
In this subsection we present empirical evidence that SRHD-SGI preserves the most important characteristics of SR-SGI: easy-hard-easy pattern of hardness evolution and exponential growth of the empirical hardness. Easy-hard-easy pattern. The empirical hardness of the SAT encoded SR-SGI exhibits an easy-hard-easy pattern when plotted against p. This is an important feature of SR-SGI, as it offers a large selection of “predictable-hard” instances. Given the similarities between SR-SGI and SRHD-SGI we expected that SRHDSGI also exhibits an easy-hard-easy pattern, and the experiments confirmed our intuition. In this experiment we fixed m, q and n and let p vary. For all solvers, 2
Brief descriptions of the solvers are available on the SAT competition website [10].
An Improved Satisfiable SAT Generator
47
Time VS Number of variables (logscale y)
Median Time (ms)
1e+06 SATzilla2009_C SATzilla2009_I SAperlo_T clasp gnovely+ picosat precosat
100000
10000
1000
100 240
260
280
300
320
340
360
Median Number of variables
Fig. 2. Exponential growth of the hardest SAT encoded SRHD-SGI instances (n=19)
A - clasp (m=19,q=145,n=18)
B - gnovelty+ (m=19,q=136,n=19)
400000
9000 SR-SGI SRHD-SGI
Median Time(ms)
350000
7000
300000
6000
250000
5000 200000
4000
150000
3000
100000
2000
50000
1000
0
0 0
10
20
30
40
50
C - picosat and precosat (m=19,q=136,n=16) 16000
0
10
20
30
40
50
D - SAperloT and SATzilla_I (m=19,q=145,n=17) 250000
picosat on SR-SGI picosat on SRHD-SGI precosat on SR-SGI precosat on SRHD-SGI
14000 Median Time(ms)
SR-SGI SRHD-SGI
8000
12000
SAperloT on SR-SGI SAperloT on SRHD-SGI SATzilla_I on SR-SGI SATzilla_I on SRHD-SGI
200000
10000
150000
8000 100000
6000 4000
50000
2000 0
0 0
10
20
30
40
50
0
10
20
30
40
50
Fig. 3. Comparison of SAT encoded SRHD-SGI and SRSGI (x-axis - p in %.)
we noticed an easy-hard-easy pattern in the variation of the empirical hardness - see figure 1. The same pattern occurred when the number of visited nodes - for complete solvers, or steps - for incomplete solvers were plotted against p. For fixed m, and n, and for all solvers, we noticed that the value of p at which the
48
C. Anton
hardness peaks, decreases as q increases. A possible reason for this correlation is that as H becomes denser, fewer edges need to be removed from G for H to contain many copies of G, which makes the instances easy. When m and q are fixed, the value of p which produces the hardest instances, increases as n increases. The presence of the tell-tales is a possible explanation for this behavior. For fixed H, which is the case when m and q are fixed, the number of tell-tales increases as the number of vertices of G increases - more high degree vertices of H are used. More edges need to be removed from G, for hiding a larger set of tell-tales, and this explains why the value of p at which the hardness peaks increases with n. Exponential growth rate. The empirical hardness of the SAT encoded SRSGI instances - generated at the hardness peak- increases exponentially with the number of variables. This is a desirable characteristic of any generator, as it implies that the generated instances will be hard even asymptotically. We expected that it may also be the case for SAT encoded SRHD-SGI. To check this hypothesis we fixed m and plotted the hardness of the SRHD-SGI instances from the hardness peak against their number of variables - see figure 2. The shape of the curves is essentially exponential. Similar curves were obtained when the hardness of the instances was plotted against their size - number of literals. 3.2
SRHD-SGI Generates Harder SAT Instances than SR-SGI
In this subsection we compare SRHD-SGI with SR-SGI. The main purpose of this comparison is to asses the hardness of SRHD-SGI instances. We expect that the vertex selection procedure of SRHD-SGI combined with its edge removal method produces harder SAT instances. To check this hypothesis we compared instances of the two models, generated for the same values of the parameters. For all solvers, for the same values of the parameters, the hardest SRHD-SGI instances are at least as difficult as the hardest SR-SGI ones; in most of the cases the former are two to three times more difficult than the latter, and in some cases the difference is more than an order of magnitude. The difference between the hardness of the peak instances of the two models is larger for smaller values of n. We believe that this is a consequence of the vertex selection method of SRHDSGI. For small values of n, the vertex selection method of SRHD-SGI and the random vertex selection may produce significantly different sets of vertices, while for large values of n, the sets of vertices produced by the two selection methods are only slightly different. We noticed an interesting pattern: for small values of p, SR-SGI instances are harder than their SRHD-SGI counterparts. For large values of p, SRHDSGI instances are harder. When running times on the instances generated by the two models are plotted against p, the two plots cross over at values of p smaller than the ones that produce the hardest instances - see figure 3, and this is consistent among all solvers. We think that this behavior is a consequence of the vertex and edge selection methods of SRHD-SGI, which highlights the importance of hiding the tell-tales. When p is small, only few edges are removed from G and therefore the high-degree vertices are preserved in G making the
An Improved Satisfiable SAT Generator
49
(unique) hidden solution easy to find - easier than for the random selection of vertices. As p increases, more edges connecting high degree vertices are removed, and thus highest degree vertices are suppressed, which conceals the tell-tales. Furthermore the edge removal procedure, makes G more uniform and therefore, it increases the likelihood that the variable and value heuristics are mislead by the numerous almost solutions. We believe that this is the region where the hardest instances are generated and this is the reason for the superior hardness of SRHD-SGI. When p becomes large, it is expected that H will contain many copies of G. However, the selection methods of SRHD-SGI make G denser than the corresponding counterpart of SR-SGI and therefore H contains fewer copies of it, which makes the instances harder than the SR-SGI ones.
4
Conclusion
We introduced and empirically analyzed a generator of satisfiable SAT instances, based on subgraph isomorphism. We showed that it preserves the main characteristics of SR-SGI: the easy-hard-easy pattern of the evolution of the empirical hardness of the instances and the exponential growth of the empirical hardness. This is consistent for both complete and incomplete solvers. We presented empirical evidence that this model produces harder satisfiable SAT instances, than SR-SGI. All these features indicate that this is a better model for generating satisfiable SAT instances and Pseudo Boolean instances.
References 1. Audemard, G., Jabbour, S., Sais, L.: SAT graph-based representation: A new perspective. J. Algorithms 63(1-3), 17–33 (2008) 2. Ans´ otegui, C., B´ejar, R., Fern´ andez, C., Mateu, C.: Generating hard SAT/CSP instances using expander graphs. In: Proc. AAAI 2008, pp. 1442–1443 (2008) 3. Xu, K., Boussemart, F., Hemery, F., Lecoutre, C.: A simple model to generate hard satisfiable instances. In: Proceedings of IJCAI 2005, pp. 337–342 (2005) 4. Achlioptas, D., Gomes, C., Kautz, H., Selman, B.: Generating satisfiable problem instances. In: Proceedings of AAAI 2000, pp. 256–261 (2000) 5. Anton, C., Olson, L.: Generating satisfiable SAT instances using random subgraph isomorphism. In: Gao, Y., Japkowicz, N. (eds.) AI 2009. LNCS, vol. 5549, pp. 16–26. Springer, Heidelberg (2009) 6. Culberson, J., Gao, Y., Anton, C.: Phase transitions of dominating clique problem and their implications to heuristics in satisfiability search. In: Proc. IJCAI 2005, pp. 78–83 (2005) 7. Anton, C., Neal, C.: Notes on generating satisfiable SAT instances using random subgraph isomorphism. In: Farzindar, A., Keˇselj, V. (eds.) Canadian AI 2010. LNCS, vol. 6085, pp. 315–318. Springer, Heidelberg (2010) 8. Culberson, J.: Hidden solutions, tell-tales, heuristics and anti-heuristics. In: IJCAI 2001 Workshop on Empirical Methods in AI, pp. 9–14 (2001) 9. Bayardo, R., Schrag, R.: Using CSP look-back techniques to solve exceptionally hard SAT instances. In: Freuder, E.C. (ed.) CP 1996. LNCS, vol. 1118, pp. 46–60. Springer, Heidelberg (1996) 10. The international sat competitions web page, http://www.satcompetition.org
Utility Estimation in Large Preference Graphs Using A* Search Henry Bediako-Asare1, Scott Buffett2 , and Michael W. Fleming1 1
2
University of New Brunswick, Fredericton, NB, E3B 5A3 {o3igd,mwf}@unb.ca National Research Council Canada, Fredericton, NB, E3B 9W4 [email protected]
Abstract. Existing preference prediction techniques can require that an entire preference structure be constructed for a user. These structures, such as Conditional Outcome Preference Networks (COP-nets), can grow exponentially in the number of attributes describing the outcomes. In this paper, a new approach for constructing COP-nets, using A* search, is introduced. Using this approach, partial COP-nets can be constructed on demand instead of generating the entire structure. Experimental results show that the new method yields enormous savings in time and memory requirements, with only a modest reduction in prediction accuracy.
1
Introduction
In recent years, the idea of autonomous agents representing users in some form of automated negotiation has gained a significant amount of interest [6–8]. This has inspired research in finding effective techniques for modeling user preferences and also eliciting preferences from a user [2, 3, 10]. Preference networks have been developed to graphically represent models of user preferences over a set of outcomes. Two such networks are Boutilier et al.’s Conditional Preference Network (CP-net) [1] and Chen et al.’s Conditional Outcome Preference Network (COP-net) [5]. Using preference elicitation techniques, such as standard gamble questions or binary comparisons [9], an agent can obtain information on a user’s preferences in order to create a network. As the number of attributes grows, the size of such a network becomes unmanageable and it is typically infeasible to learn all preferences over a large number of outcomes. Therefore, given only a small number of preferences, the agent will have to predict as many others as possible over the set of outcomes. Often, preferences over a small number of outcomes are all that is needed, such as when determining whether one particular outcome is preferred over another. Here, it is valuable to be able to build a partial COP-net, containing only outcomes that are relevant in determining the relationship between the two outcomes of interest, without compromising preference prediction accuracy. In this paper, such an approach for constructing COP-nets is introduced, using A* search. Using this new methodology, smaller COP-nets can be constructed on demand, eliminating the need to generate a network for an entire set of outcomes. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 50–55, 2011. c Her Majesty the Queen in Right of Canada 2011
Utility Estimation in Large Preference Graphs Using A* Search
2
51
Preference Networks
A Conditional Outcome Preference Network (COP-net) [4] is a directed graph that represents preferences over the set of outcomes. Every outcome is represented by a vertex, and for vertices v and v representing outcomes o and o , respectively, if v is a proper ancestor of v then o is preferred over o . The graph is transitively reduced by the removal of redundant edges. In addition to modeling the user’s preferences during the elicitation stage, the COP-net can also be used to estimate a utility function over the set of outcomes. Given an initial partial utility assignment, including at least the most preferred outcome (utility 1) and the least preferred (utility 0), a utility function u ˆ over the entire set of outcomes is produced. This is done in such a way as to preserve the preference ordering specified by the COP-net. Specifically, if v and ˆ(o) > u ˆ(o ). v represent outcomes o and o and v is a proper ancestor of v , then u Estimating a utility for every outcome allows one to compare two outcomes that might otherwise have no direct relationship in the graph.
3 3.1
Generating Partial COP-Nets Motivation for Partial COP-Net Construction
The current method for constructing COP-nets provides a reasonably accurate model for representing a user’s preferences. An agent will, with high frequency, be able to correctly predict a user’s preference given any two outcomes, provided a sufficient amount of preference information has been elicited from the user [5]. However, with the current solution, in order to estimate the utility of a small number of outcomes, or even a single outcome, the entire structure must be constructed. Since the number of outcomes grows exponentially in the number of attributes, using such graphs to represent a preference profile becomes infeasible for problems with large numbers of attributes/values. It would therefore be valuable to be able to construct only a partial COPnet when predicting preferences. For example, consider the partial COP-net in Figure 1 (left). This graph contains some valuable information regarding the likely preference over oi and oj . In particular, two chains (p1 and p2 ) through the space of arcs in the graph are highlighted that indicate that oi is likely “higher” in the actual COP-net than oj . This indicates that there is evidence to support the prediction that oi has higher utility than oj . The goal then of this paper is to generate partial COP-nets by attempting to construct such connections between the outcomes in question in the COP-net. Once the partial COP-net is constructed, the preference information represented in the connections in the partial COP-net is then exploited to determine the likely preference. 3.2
Partial COP-Net Composition
The partial COP-net is composed by finding chains of arcs through the implicit COP-net that connect the outcomes in question. These chains would represent
52
H. Bediako-Asare, S. Buffett, and M.W. Fleming
p1
o1
oi p2
o2
o3
oj
o4 Fig. 1. (Left) A partial COP-net for deciding the likely preference over oi and oj and (right) a chain of nodes found using preferences o1 o2 , o1 o3 and o3 o4
paths through the COP-net if direction were removed from the arcs, but do not necessarily (and are in fact very unlikely to) represent directed paths. For example, p1 and p2 in Figure 1 (left) represent two such valid chains we seek to find. We choose to generate the partial COP-net by constructing four chains as follows. Given any pair of outcomes, oi and oj , initially oi is designated as the start node and oj as the goal node. Two chains are then generated from oi to oj , one that passes through a parent node of oi and one that passes through a child node of oi . A second pair of chains is then generated similarly, but with oj as the start node and oi as the goal node, with one chain passing through a parent node of oj and one passing through a child node of oj . The idea here is to find a diverse sample of chains that reach both above and below each outcome in question. The two pairs of chains are then merged to obtain a partial COP-net from which utilities are estimated and preferences predicted. 3.3
Search Space
The set of known preferences can be seen as offering an implicit representation of the true COP-net. For example, if the preference o1 o2 is specified, then this implies that the node representing o1 is an ancestor of the node representing o2 in the true COP-net. A chain through the implicit COP-net can thus be constructed by jumping from one outcome to the next by finding preferences that dictate that one outcome is more preferred than another (and is thus an ancestor) or less preferred (and is thus a descendant). For example, if preferences are known that dictate that o1 o2 , o1 o3 and o3 o4 , then a chain can be constructed as depicted in Figure 1 (right). The goal is then to find reasonably small chains through the COP-net space, which will in turn result in reasonably small partial COP-nets. Since we employ the ceteris paribus assumption (i.e. “all else equal”), preferences can be quite general and therefore a single preference may dictate a large number of relationships. For example if there are two attributes A and B with values {a1 , a2 } and {b1 , b2 } respectively, then the preference a1 a2 under ceteris paribus implies both that a1 b1 is an ancestor of a2 b1 and that a1 b2 is an
Utility Estimation in Large Preference Graphs Using A* Search
53
ancestor of a2 b2 . We also allow conditional preferences of the form c : x y, meaning that given the condition that attribute values specified by c hold, the values specified by x are preferred over the values specified by y, all else equal. To illustrate how the search is performed, assume the current node in the search is a1 b1 c1 and the elicited preferences from the user include a1 a2 and b1 : c2 c1 . Applying a1 a2 will allow the search to go “down” from a1 b1 c1 to a2 b1 c1 , (i.e. since a2 b1 c1 must be a descendant of a1 b1 c1 in the COP-net), and applying b1 : c2 c1 will allow us to go “up” from a1 b1 c1 to a1 b1 c2 . 3.4
The Heuristic Function
A consequence of the generality realized by employing ceteris paribus in the preference representation is that there may be several possible preferences that can be applied to an outcome, and thus several possible steps from which one needs to choose during the search. To increase the likelihood of identifying short chains between outcomes, we employ a heuristic to help choose which of the elicited preferences to apply and thus which outcome to select as a successor to a current outcome during the search. The heuristic used to guide the search is defined as follows. Let oi be the outcome representing the initial state in the search, let oj represent the goal state and let on be a possible choice for the next state in the search for oj . The heuristic h(on ) for on is then set to be equal to the number of attributes whose values differ in on and oj . Such a heuristic should then guide the search toward the goal node. If g(on ) represents the number of steps in the chain found from oi to on , then choices are made that minimize f (on ) = g(on ) + h(on ). The heuristic is both admissible and consistent under the assumption that only single-attribute preferences are specified (which can be conditional on values for any number of attributes), which is typically the case in practice. Admissibility holds because of the fact that, since only one attribute value can be changed at each step, h(on ) cannot be an overestimate of the number of steps from on to the goal. Consistency holds due to the fact that, if h(on ) = k is the number of attribute values on which on and the goal outcome oj differ, then for any outcome om that can be reached from on in one step, h(om ) is at least k − 1. Therefore, h(on ) is no larger than h(om ) plus the actual cost of moving from on to om (which is 1), which satisfies the definition of consistency. 3.5
Analyzing the Partial COP-Net to Predict Preferences
Once the partial COP-net is constructed by merging the four chains into a single graph, the next step is to exploit the inherent structure to estimate a utility for each node in the graph. Utilities of outcomes in a partial COP-net are estimated using a version of a method referred to as the Longest Path technique [5]. Once utilities are estimated, it is a simple matter of comparing estimates for the nodes in question and selecting the highest as the most preferred.
54
4
H. Bediako-Asare, S. Buffett, and M.W. Fleming
Results
Figure 2 (left) shows the accuracy of the partial COP-net method for preference prediction compared to using full COP-nets. In order to give a full comparison, a third baseline approach was also evaluated. While 50% might be considered the worst prediction accuracy one could achieve (i.e. by guessing), one could easily achieve better success than this by using preference information available and making more educated guesses. The baseline is thus an estimate of the best one could do using simplistic methods. We aim to ensure that, while we do not expect our partial COP-net method to achieve the same accuracy rate as the full COP-net approach, it should do reasonably well compared to the baseline. With the baseline approach, predictions were made simply by choosing the outcome with the higher number of individual attributes with preferred values. For example, consider two outcomes a1 b1 c1 d1 and a2 b2 c2 d2 . If the elicited preferences were a1 a2 , b2 b1 and c1 c2 , then the baseline method would choose a1 b1 c1 d1 as the more preferred, since it contains the preferred value for two of the attributes, while a2 b2 c2 d2 only contains one, with one being unknown. Figure 2 demonstrates that our partial COP-net method performs reasonably well when compared with the full COP-net method and the baseline approach. A paired t-test shows that the difference in means between results from the partial COP-net method and baseline approach is statistically significant at the p < 0.05 level for all problem sizes. 100.00%
100%
90.00% 80.00%
% decrease in outcomes
% accuracy
90% 80% 70% 60% 50%
70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00%
3 6 9 16 24 32 48 64 81 108 144 192 243 288 384 486 576 729 864 1024 1296 1536 1944 2304
3 6 9 16 24 32 48 64 81 108 144 192 243 288 384 486 576 729 864 1024 1296 1536 1944 2304
0.00%
Possible number of outcomes (problem size)
Number of outcomes in problem defintion (problem size)
Fig. 2. (Left) Accuracy of the full COP-net approach (best), partial COP-net approach (2nd-best) and baseline approach (worst) and (right) reduction in the number of outcomes considered by using the partial COP-net approach
The main objective of this paper is to show that we can still obtain a reasonably high prediction accuracy, while exploring only a tiny fraction of the space of outcomes. Figure 2 (right) demonstrates this, showing that we can ignore up to 98% of all outcomes for problems of only about 2000 outcomes. The trend indicates that this number will continue to increase with the size of the problems. We also examined what this reduction means in terms of computation speed. We found that only the tiniest fraction of computation time is now required.
Utility Estimation in Large Preference Graphs Using A* Search
55
For example, problems with 2304 outcomes required an average of over three hours to solve with the full COP-net approach, while the partial COP-net approach took just two seconds. This means that a vast space of situations that previously had too many outcomes to allow for any reasonable preference prediction technique is now manageable using our new technique.
5
Conclusions
The test results clearly demonstrate the benefits of the proposed methodology for constructing partial COP-nets. Although it sacrifices some prediction accuracy, it provides enormous savings in time and memory requirements. For example, in cases where it would have taken over three hours for the current methodology to build a COP-net and estimate utilities of outcomes, the proposed methodology takes just a few seconds, with only a modest reduction in prediction accuracy (80-90% instead of 90-95% for problems with more than 500 outcomes). Perhaps most importantly, the reduction in time and space requirements allows for fast predictions in cases where it would have been completely infeasible before.
References 1. Boutilier, C., Brafman, R.I., Domshlak, C., Hoos, H.H., Poole, D.: CP-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements. Journal of Artificial Intelligence Research 21, 135–191 (2004) 2. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Regret-based utility elicitation in constraint-based decision problems. In: Proceedings of IJCAI 2005, Edinburgh, Scotland, pp. 929–934 (2005) 3. Chajewska, U., Koller, D., Parr, R.: Making rational decisions using adaptive utility elicitation. In: AAAI 2000, Austin, Texas, USA, pp. 363–369 (2000) 4. Chen, S.: Reasoning with conditional preferences across attributes. Master’s thesis, University of New Brunswick (2006) 5. Chen, S., Buffett, S., Fleming, M.W.: Reasoning with conditional preferences across attributes. In: Proc. of AI 2007, Montreal, Canada, pp. 369–380 (2007) 6. Faratin, P., Sierra, C., Jennings, N.R.: Using similarity criteria to make issue tradeoffs in automated negotiations. Artificial Intelligence 142, 205–237 (2002) 7. Fatima, S.S., Wooldridge, M., Jennings, N.R.: Optimal negotiation of multiple issues in incomplete information settings. In: Proc. 3rd Int. Conf. on Autonomous Agents and Multi-Agent Systems, New York, NY, pp. 1080–1087 (2004) 8. Jennings, N.R., Faratin, P., Lomuscio, A., Parsons, S., Sierra, C., Wooldridge, M.: Automated negotiation: prospects, methods and challenges. Int. J. of Group Decision and Negotiation 10(2), 199–215 (2001) 9. Keeney, R.L., Raiffa, H.: Decisions with Multiple Objectives: Preferences and Value Tradeoffs. John Wiley and Sons, Inc., Chichester (1976) 10. Sandholm, T., Boutilier, C.: Preference elicitation in combinatorial auctions. In: Cramton, P., Shoham, Y., Steinberg, R. (eds.) Combinatorial Auctions (2006)
A Learning Method for Developing PROAFTN Classifiers and a Comparative Study with Decision Trees Nabil Belacel and Feras Al-Obeidat Institute for Information Technology National Research Council of Canada
Abstract. PROAFTN belongs to Multiple-Criteria Decision Aid (MCDA) paradigm and requires a several set of parameters for the purpose of classification. This study proposes a new inductive approach for obtaining these parameters from data. To evaluate the performance of developed learning approach, a comparative study between PROAFTN and a decision tree in terms of their learning methodology, classification accuracy, and interpretability is investigated in this paper. The major distinguished property of Decision tree is that its ability to generate classification models that can be easily explained. The PROAFTN method has also this capability, therefore avoiding a black box situation. Furthermore, according to the proposed learning approach in this study, the experimental results show that PROAFTN strongly competes with ID3 and C4.5 in terms of classification accuracy. Keywords: Classification, PROAFTN, Decision Tree, MCDA, Knowledge Discovery.
1 Introduction Decision tree learning is a widely used method in data mining and machine learning. The strength of decision trees (DT) can be summarized as: (1) Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. (2) Not a black box model. The classification model can be easily explained by boolean logic. (3) The methodology used to construct a classification model is not hard to understand. (4) The classification results are usually reasonable. These advantages of DT make it a common and highly used classification method in research and applications [4]. This paper introduces a new learning technique for the classification method PROAFTN which requires several parameters (e.g intervals, discrimination thresholds and weights) that need to be determined to perform the classification. This study investigates a new automatic approach for the elicitation of PROAFTN parameters from data and prototypes during training process. The major characteristics of PROAFTN can be summarized as follows: – PROAFTN is not a black box and the results are automatically explained, that is it provides the possibility of access to more detailed information concerning the classification decision. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 56–61, 2011. c Her Majesty the Queen in Right of Canada 2011
Learning PROAFTN with a Comparative Study with DT
57
– PROAFTN can perform two learning paradigms: deductive and inductive learning. In the deductive approach, the decision maker has the role of establishing the required parameters for the studied problem, whereas in an inductive approach, the parameters and the classification models are obtained automatically from the datasets. Based on what have been presented above, one can see that DT and PROAFTN can generate classification models which can be easily explained and interpreted. However, when evaluating any classification method there is another important factor to be considered: classification accuracy. Based on the experimental study presented in Section 4, PROAFTN can generate a higher classification accuracy than decision tree learning algorithms: ID3 and C4.5 [9]. The paper is organized as follows. Section 2 presents the PROAFTN method. Section 3 proposes automatic learning methods based on machine learning techniques to infer PROAFTN parameters and prototypes. In Section 4 a comparative study based on computational results generated by PROAFTN and DT (ID3 and C4.5) on some wellknown datasets is presented and analyzed. Finally, conclusions and future works are presented in Section 5.
2 PROAFTN Method PROAFTN procedure belongs to the class of supervised learning to solve classification problems. PROAFTN has been applied to the resolution of many real-world practical problems [6] [7] [10]. The following subsections describe the required parameters, the classification methodology, and the procedure used by PROAFTN. 2.1 Initialization From a set of n objects known as a training set, consider a is an object which requires to be classified; assume this object a is described by a set of m attributes {g1 , g2 , ..., gm } and z classes {C1 ,C2 , ...,Cz }. Given an object a described by the score of m attributes, for each class Ch , we determine a set of Lh prototypes. For each prototype bhi and each attribute g j , an interval [S1j (bhi ), S2j (bhi )] is defined where S2j (bhi ) ≥ S1j (bhi ). To apply PROAFTN, the intervals: the pessimistic [S1j (bhi ), S2j (bhi )] and the optimistic 1 [S j (bhi ) − d 1j (bhi ), S2j (bhi ) + d 2j (bhi )] should be determined prior to classification for each attribute. As mentioned above, the indirect technique approach will be adapted to infer these intervals. The following subsections explain the stages required to classify the object a to the class Ch using PROAFTN. 2.2 Computing the Fuzzy Indifference Relation To use the classification method PROAFTN, we need first to calculate the fuzzy indifference relation I(a, bhi ). The calculation of the fuzzy indifference relation is based on the concordance and non-discordance principle which is identified by: I(a, bhi ) =
m
∑ whjC j (a, bhi)
j=1
(1)
58
N. Belacel and F. Al-Obeidat
Weak Indifference
Cj (a, bhi )
d2j
d1j
1
0
Strong Indifference
No Indifference
No Indifference
gj (a) Sj1 -d1j
Sj1
Sj2
Sj2 +d2j
Fig. 1. Graphical representation of the partial indifference concordance index between the object a and the prototype bhi represented by intervals
where whj is the weight that measures the importance of a relevant attribute g j of a specific class Ch : w j ∈ [0, 1] , and
m
∑ whj = 1
j=1
j = 1, ..., m; h = 1, ..., z C j (a, bhi )
is the degree that measures the closeness of the object a to the prototype bhi according to the attribute g j . To calculate C j (a, bhi ), two positive thresholds d 1j (bhi ) and d 2j (bhi ) need to be obtained. The computation of C j (a, bhi ) is graphically presented in Fig. 1. 2.3 Evaluation of the Membership Degree The membership degree between the object a and the class Ch is calculated based on the indifference degree between a and its nearest neighbor in Bh . The following formula identifies the nearest neighbor: d(a,Ch ) = max{I(a, bh1 ), I(a, bh2 ), ..., I(a, bhLh )}
(2)
2.4 Assignment of an Object to the Class The last step is to assign the object a to the right class Ch ; the calculation required to find the right class is straightforward: a ∈ Ch ⇔ d(a,Ch ) = max{d(a,Ci )/i ∈ {1, ..., z}}
(3)
3 Proposed Techniques to Learn PROAFTN As discussed earlier, PROAFTN requires the elicitation of its parameters for the purpose of classification. Several approaches have been used to learn PROAFTN in [1] [2] [3].
Learning PROAFTN with a Comparative Study with DT
59
Algorithm 1. Building the classification model for PROAFTN 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
Determine of a threshold β as reference for interval selection z ← Number of classes, i ← Prototype’s index m ← Number of attributes, k ← Number of intervals for each attribute 2r h I rjh ← Apply discretization to get {S1r jh , S jh } for each attribute g j in each class C r ℜ ← Percentage of values within the interval I jh per class Generate PROAFTN intervals using discretization for h =← 1, z do i←0 for g ← 1, m do for r ← 1, k do if ℜ of I rjh ≥ β then
Choose this interval to be part of the prototype bhi Go to next attribute gm+1 else Discard this interval and find another one (i.e., I r+1 jh ) end if end for end for if (bhi = 0/ ∀g jh ) then i ← i + 1 end if (Prototypes’ composition): The selected branches from attribute g1 to attribute gm represent the induced prototypes for the class Ch 23: end for 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:
In this study however, a different technique is proposed to get these parameters from data. During the learning process, the necessary preferential information (a.k.a. prototypes) required to construct the classification model are extracted first; then this information are used for assigning the new cases (testing data) to the closest class. The PROAFTN parameters that are required to be elicited automatically from training dataset are: {S1j (bhi ), S2j (bhi ), d 1j (bhi ), d 2j (bhi )}. This study proposes the discretization techniques to infer these parameters. Once these parameters are determined, the next stage is to build the classification model, which consists of a set of prototypes that represents each category. The obtained prototypes can then be used to classify the new instances. Discretization techniques are utilized to obtain the intervals [S1j (bhi ), S2j (bhi )] automatically for each attribute in the training dataset. The obtained intervals will then be adjusted to get the other fuzzy intervals [S1j (bhi ) − d 1j (bhi ), S2j (bhi ) + d 2j (bhi )], which will be used subsequently for building the classification model. Following to the discretization phase is model development stage. The proposed model uses an induction approach given in Algorithm 1. The tree is constructed in a top-down recursive manner, where each branch represents the generated intervals for each attribute. The prototypes can then be extracted for the decision tree to compose decision rules to be used for classifying testing data.
60
N. Belacel and F. Al-Obeidat Table 1. Dataset Description Dataset Instances Attributes Classes Breast Cancer 699 11 2 Heart Disease 303 14 2 Haberman’s Survival 306 3 2 Iris 150 4 3 Mammographic Mass 961 4 2 Pima Diabetes 768 8 2 Vehicle 846 18 4 Vowel Context 990 11 10 Wine 178 13 3 Yeast 1484 8 10
4 Application of the Developed Algorithms The proposed method was implemented in java applied to 10 popular datasets described in Table 1. These datasets are available on the public domain of the University of California at Irvine (UCI) [5]. To compare our proposed approaches with ID3 and C4.5 algorithms, we have used the open source platform Weka [11] for this purpose. The comparisons are made on all datasets using stratified 10-fold cross validation. The generated results applied on the datasets for PROAFTN, ID3 and C4.5 (pruned and unpruned) is shown in Table 2. The Friedman test [8] is used to recognize the performance of PROAFTN against other DT classifiers. Table 2. ID3 and C4.5 versus PROAFTN in terms of classification accuracy
1 2 3 4 5 6 7 8 9 10 Avg Rank
Algorithm / Dataset Breast Cancer Heart Disease Haberman’s Survival Iris Mammographic Mass Pima Diabetes Vehicle Vowel context Wine Yeast
ID3 89.80 74.10 59.80 90.00 75.35 58.33 60.77 72.42 80.5 41.71 70.28 4
C4.5 (unpruned) 94.56 74.81 70.92 96.00 81.27 71.22 72.93 82.63 91.55 54.78 79.07 3
C4.5 PROAFTN (pruned) 94.56 97.18 76.70 79.04 71.90 70.84 96.00 96.57 82.10 84.30 71.48 72.19 72.58 76.36 82.53 81.86 91.55 97.33 56.00 57.00 79.54 81.27 2 1
5 Discussion and Conclusions The common advantages of the PROAFTN method and the DT could be summarized as: (i) Reasoning about the results, therefore avoiding black box situations, and (ii)
Learning PROAFTN with a Comparative Study with DT
61
Simple to understand and to interpret. Furthermore, in this study PROAFTN was able to outperform ID3 and C4.5 in terms of classification accuracy. To apply PROAFTN, some parameters should be determined before performing classification procedures. This study proposed the indirect technique by using discretization to establish these parameters from data. It has been shown in this study that PROAFTN is a promising classification method to be applied in a decision-making paradigm and knowledge discovery process. Hence, we have a classification method that relatively outperforms DT and is also interpretable. More improvements could be made to enhance PROAFTN; this includes (i) involve the weights factor in the learning process. The weights in this paper are assumed to be equal; (ii) extend the comparative study to include various classification methods from different paradigms.
References 1. Al-Obeidat, F., Belacel, N., Carretero, J.A., Mahanti, P.: A Hybrid Metaheuristic Framework for Evolving the PROAFTN Classifier. Special Journal Issues of World Academy of Science, Engineering and Technology 64, 217–225 (2010) 2. Al-Obeidat, F., Belacel, N., Carretero, J.A., Mahanti, P.: Automatic Parameter Settings for the PROAFTN Classifier Using Hybrid Particle Swarm Optimization. In: Li, J. (ed.) AI 2010. LNCS, vol. 6464, pp. 184–195. Springer, Heidelberg (2010) 3. Al-Obeidat, F., Belacel, N., Carretero, J.A., Mahanti, P.: Differential Evolution for learning the classification method PROAFTN. Knowledge-Based Systems 23(5), 418–426 (2010) 4. Apteand, C., Weiss, S.: Data mining with decision trees and decision rules. Future Generation Computer Systems (13) (1997) 5. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007) 6. Belacel, N., Boulassel, M.: Multicriteria fuzzy assignment method: A useful tool to assist medical diagnosis. Artificial Intelligence in Medicine 21(1-3), 201–207 (2001) 7. Belacel, N., Vincke, P., Scheiff, M., Boulassel, M.: Acute leukemia diagnosis aid using multicriteria fuzzy assignment methodology. Computer Methods and Programs in Biomedicine 64(2), 145–151 (2001) 8. Demˇsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006) 9. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993) 10. Sobrado, F.J., Pikatza, J.M., Larburu, I.U., Garcia, J.J., de Ipi˜na, D.: Towards a clinical practice guideline implementation for asthma treatment. In: Conejo, R., Urretavizcaya, M., P´erez-de-la-Cruz, J.-L. (eds.) CAEPIA/TTIA 2003. LNCS (LNAI), vol. 3040, pp. 587–596. Springer, Heidelberg (2004) 11. Witten, H.: Data Mining: Practical Machine Learning Tools and Techniques. Kaufmann Series in Data Management Systems (2005)
Using a Heterogeneous Dataset for Emotion Analysis in Text Soumaya Chaffar and Diana Inkpen School of Information Technology and Engineering, University of Ottawa Ottawa, ON, Canada {schaffar,diana}@site.uottawa.ca
Abstract. In this paper, we adopt a supervised machine learning approach to recognize six basic emotions (anger, disgust, fear, happiness, sadness and surprise) using a heterogeneous emotion-annotated dataset which combines news headlines, fairy tales and blogs. For this purpose, different features sets, such as bags of words, and N-grams, were used. The Support Vector Machines classifier (SVM) performed significantly better than other classifiers, and it generalized well on unseen examples. Keywords: Affective Computing, Emotion Analysis in Text, Natural Language Processing, Text Mining.
1 Introduction Nowadays the emotional aspects attract the attention of many research areas, not only in computer science, but also in psychology, healthcare, communication, etc. For instance, in healthcare some researchers are interested in how acquired diseases of the brain (e.g., Parkinson) affect the ability to communicate emotions [10]. Otherwise, with the emergence of Affective Computing in the late nineties [11], several researchers in different computer science areas, e.g., Natural Language Processing (NLP), Human Computer Interaction (HCI), etc. are interested more and more in emotions. Their aim is to develop machines that can detect users' emotions and express different kinds of emotion. The most natural way for a computer to automatic emotion recognition of the user is to detect his emotional state from the text that he entered in a blog, an online chat site, or in another form of text. Generally, two approaches (knowledge-based approaches and machine learning approaches) were adopted for automatic analysis of emotions in text, aiming to detect the writer’s emotional state. The first approach consists of using linguistic models or prior knowledge to classify emotional text. The second one uses supervised learning algorithms to build models from annotated corpora. For sentiment analysis, machine learning techniques tend to obtain better results than lexical-based techniques, because they can adapt well to different domains [7]. In this paper, we adopted a machine learning approach for automatic emotion recognition from text. For this purpose, we used a heterogeneous dataset collected from blogs, fairly tales and news headlines. The rest of the paper is organized as follows: Section 2 identifies the several datasets that we used for our emotion detection in text. In Section 3, we describe the C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 62–67, 2011. © Springer-Verlag Berlin Heidelberg 2011
Using a Heterogeneous Dataset for Emotion Analysis in Text
63
methodology that we adopted for this purpose. Section 4 presents and discusses the results by comparing different machine learning techniques for detecting emotion in texts. Finally, Section 5 concludes the paper and outlines the future direction of our research.
2 Datasets Five datasets have been used in the experiments reported in this paper. We describe each one in details below. 2.1 Text Affect This data consists of news headlines drawn from the most important newspapers, as well as from the Google News search engine [12] and it has two parts. The first one is developed for the training and it is composed of 250 annotated sentences. The second one is designed for testing and it consists of 1,000 annotated sentences. Six emotions (anger, disgust, fear, joy, sadness and surprise) were used to annotate sentences according to the degree of emotional load. For our experiments, we further use the most dominant emotion as the sentence label, instead of a vector of scores representing each emotion. 2.2 Neviarouskaya et al.’s Dataset Two datasets produced by these authors were used in our experiments [8, 9]. In these datasets, ten labels were employed to annotate sentences by three annotators. These labels consist of the nine emotional categories defined by Izard [8] (anger, disgust, fear, guilt, interest, joy, sadness, shame, and surprise) and a neutral category. In our experiments, we considered only sentences on which two annotators or more completely agreed on the emotion category. We briefly describe in the following the two datasets. • Dataset 1 This dataset includes 1000 sentences extracted from various stories in 13 diverse categories such as education, health, and wellness [8]. • Dataset 2 This dataset includes 700 sentences from collection of diary-like blog posts [9]. 2.3 Alm’s Dataset This data include annotated sentences from fairy tales [1]. For our experiments, we used only sentences with high annotation agreement, in other words sentences with four identical emotion labels. Five emotions (happy, fearful, sad, surprised and angry-disgusted) from the Ekman’s list of basic emotions were used for sentences annotations. Because of data sparsity and related semantics between anger and disgust, these two emotions were merged together by the author of the dataset, to represent one class.
64
S. Chaffar and D. Inkpen
2.4 Aman’s Dataset This dataset consists of emotion-rich sentences collected from blogs [3]. These sentences were labelled with emotions by four annotators. We considered only sentences for which the annotators agreed on the emotion category. Ekman’s basic emotions (happiness, sadness, anger, disgust, surprise, and fear), and also a neutral category were used for sentences annotation.
3 Emotion Detection in Text To find the best classification algorithm for emotion analysis in text, we compared the three classification algorithms from the Weka software [14] with the BOW representation: J48 for Decision Trees, Naïve Bayes for the Bayesian classifier and the SMO implementation of SVM. To ensure proper emotional classification of text, it is essential to choose the relevant feature sets to be considered. We describe in the following the ones that we employed in our experiments: • Bag-Of-Words (BOW) Each sentence in the dataset was represented by a feature vector composed of Boolean attributes for each word that occurs in the sentence. If a word occurs in a given sentence, its corresponding attribute is set to 1; otherwise it is set to 0. BOW considers words as independent entities and it does not take into consideration any semantic information from the text. However, it performs generally very well in text classification. • N-grams They are defined as sequences of words of length n. N-grams can be used for catching syntactic patterns in text and may include important text features such as negations, e.g., “not happy”. Negation is an important feature for the analysis of emotion in text because it can totally change the expressed emotion of a sentence. For instance, the sentence “I’m not happy” should be classified into the sadness category and not into hapiness. For these reasons, some research studies in sentiment analysis claimed that N-grams features improve performance beyond the BOW approach [4]. • Lexical emotion features This kind of features represents the set of emotional words extracted from affective lexical repositories such as, WordNetAffect [13]. We used in our experiments all the emotional words, from the WordNetAffect (WNA), associated with the six basic emotions.
4 Results and Discussion For an exploratory purpose, we conducted several experiments using the labelled datasets for classifying emotional sentences. 4.1 Cross-Validation First of all, it is important to prepare the data for proper emotional sentence classification. For classifying text into emotion categories, some words such as “I”
Using a Heterogeneous Dataset for Emotion Analysis in Text
65
and “the” are clearly useless and should be removed. Moreover, in order to reduce the number of words in the BOW representation we used the LovinsStemmer stemming technique from the Weka tool [14], which replaces a word by its stem. Another important way for reducing the number of words in the BOW representation is to replace negative short forms by negative long forms, e.g., “don’t” is replaced by “do not”, “shouldn’t” is replaced by “should not”, and so on. Applying this method of standardizing negative forms gave us better results for BOW representation and can consider effectively negative expressions in N-grams. In this later, the features include words, bigrams and trigrams. In the spirit of exploration, we used five datasets to train supervised machine learning algorithms: Text Affect, Alm’s dataset, Aman’s dataset and the Global dataset (see Table 1). We also used the ZeroR classifier from Weka as a baseline; it classifies data into the most frequent class in the training set. Table 1. Results for the training datasets using the accuracy rate (%)
Text Affect Alm’s Dataset Aman’s Dataset Global Dataset
Baseline 31.6 36.86 68.47 50.47
Naive Bayes 39.6 54.92 73.02 59.72
J48 32.8 47.47 71.43 64.70
SMO 39.6 61.88 81.16 71.69
The results presented in Table 1 show that in general the SMO algorithm has the highest accuracy rate for each dataset. The use of the global dataset for the training is much better, because, on one hand it contains heterogeneous data collected from blogs, fairly tales and new headlines, and on the other hand the difference between accuracy rates for the SMO algorithm and the baseline is higher compared to Aman’s dataset. With the global dataset, SMO is statistically better than the next-best classifier (J48) with a confidence level of 95% based on the accuracy rate (according to a paired t-test). Specifically, for Aman’s dataset, we achieved an accuracy rate of 81.16%, which is better than the highest accuracy rate (73.89%) reported in [2]. Compared to their work, we used not only emotional words, but also non-emotional ones, as we believe that some sentences can express emotions through underlying meaning and depending on the context, i.e., “Thank you so much for everyone who came”. From the context, we can understand that this sentence expresses happiness, but it does not include any emotional word. 4.2 Supplied Test Set Given the performance on the training datasets, one important issue that we need to consider in emotion analysis in text is the ability to generalize on unseen examples, since it depends on sentences’ context and the vocabulary used. Thus, we tested our model (trained on the global dataset) on the three testing datasets using three kinds of feature sets (BOW, N-grams, emotion words from WordNetAffect). The results are presented in Table 2 below.
66
S. Chaffar and D. Inkpen Table 2. SMO results using different feature sets
Test sets
Text Affect
Neviarouskaya et al.’s dataset 1
Neviarouskaya et al.’s dataset 2
Feature sets WNA BOW BOW +WNA N-grams WNA BOW BOW +WNA N-grams WNA BOW BOW +WNA N-grams
Accuracy rate (%) baseline SMO 36.55 38.90 36.20 36.55 40.30 44.76 57.81 24.73 56.28 49.47 48.91 53.45 35.89 52.56 50.69
As shown in Table 2, using the N-grams representation for Text Affect gives better results than the BOW representation, but the difference is not statistically significant. However, the use of N-grams representation for Neviarouskaya et al.’ datasets decreased the accuracy rate compared to the BOW representation. As we notice from the table, using features sets from WordNetAffect did not help in improving the accuracy rates of the SMO classifier.
5 Conclusion In this paper, we presented a machine learning approach for automatic emotion recognition from text. For this purpose, we used a heterogeneous dataset collected from blogs, fairly tales and new headlines, and we compared it to using each homogenous dataset separately as training data. Moreover, we showed that the SMO algorithm made a statistically significant improvement over other classification algorithms, and that it generalized well on unseen examples.
Acknowledgments We address our thanks to the Natural Sciences and Engineering Research Council (NSERC) of Canada for supporting this research work.
References 1. Alm, C.O.: Affect in Text and Speech. PhD Dissertation. University of Illinois at UrbanaChampaign (2008) 2. Aman, S., Szpakowicz, S.: Identifying expressions of emotion in text. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 196–205. Springer, Heidelberg (2007)
Using a Heterogeneous Dataset for Emotion Analysis in Text
67
3. Aman, S.: Identifying Expressions of Emotion in Text, Master’s thesis, University of Ottawa, Ottawa, Canada (2007) 4. Arora, S., Mayfield, E., Penstein-Ros, C., Nyberg, E.: Sentiment Classification using Automatically Extracted Subgraph Features. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (2010) 5. Ekman, P., Friesen, W.V.: Facial action coding system: Investigator’s guide. Consulting Psychologists Press, Palo Alto (1978) 6. Izard, C.E.: The Face of Emotion. Appleton-Century-Crofts, New York (1971) 7. Melville, P., Gryc, W., Lawrence, R.: Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification. In: Proc. of KDD, pp. 1275–1284 (2009) 8. Neviarouskaya, A., Prendinger, H., Ishizuka, M.: @AM: Textual Attitude Analysis Model. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, Los Angeles, USA (2010) 9. Neviarouskaya, A., Prendinger, H., Ishizuka, M.: Compositionality Principle in Recognition of Fine-Grained Emotions from Text. In: Proceedings of the International Conference on Weblogs and Social Media. AAAI, San Jose (2009) 10. Paulmann, S., Pell, M.D.: Dynamic emotion processing in Parkinson’s disease as a function of channel availability. Journal of Clinical and Experimental Neuropsychology 32(8), 822–835 (2010) 11. Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997) 12. Strapparava, C., Mihalcea, R.: Semeval- 2007 task 14: Affective text. In: Proceedings of the 4th International Workshop on the SemEval 2007, Prague (2007) 13. Strapparava, C., Valitutti, A., Stock, O.: The affective weight of lexicon. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy (2006) 14. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Using Semantic Information to Answer Complex Questions Yllias Chali, Sadid A. Hasan, and Kaisar Imam University of Lethbridge Lethbridge, AB, Canada {chali,hasan,imam}@cs.uleth.ca
Abstract. In this paper, we propose the use of semantic information for the task of answering complex questions. We use the Extended String Subsequence Kernel (ESSK) to perform similarity measures between sentences in a graph-based random walk framework where semantic information is incorporated by exploiting the word senses. Experimental results on the DUC benchmark datasets prove the effectiveness of our approach. Keywords: Complex Question Answering, Graph-based Random Walk Method, Extended String Subsequence Kernel.
1
Introduction
Resolving complex information needs is not possible by simply extracting named entities (persons, organizations, locations, dates, etc.) from documents. Complex questions often seek multiple different types of information simultaneously and do not presuppose that one single answer can meet all of its information needs. For example, with a factoid question like: “What is the magnitude of the earthquake in Haiti?”, it can be safely assumed that the submitter of the question is looking for a number. However, the wider focus of a complex question like: “How is Haiti affected by the earthquake?” suggests that the user may not have a single or well-defined information need and therefore may be amenable to receiving additional supporting information relevant to some (as yet) undefined informational goal [6]. This type of questions require inferencing and synthesizing information from multiple documents. This information synthesis in Natural Language Processing (NLP) can be seen as a kind of topic-oriented, informative multi-document summarization, where the goal is to produce a single text as a compressed version of a set of documents with a minimum loss of relevant information [1]. So, in this paper, given a complex question and a set of related data, we generate a summary in order to use it as an answer to the complex question. The graph-based methods (such as LexRank [4], TextRank [10]) are applied successfully to generic, multi-document summarization. In topic-sensitive LexRank [11], a sentence is mapped to a vector in which each element represents the occurrence frequency (TF–IDF1 ) of a word. However, for the task like answering 1
The TF–IDF (term frequency-inverse document frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 68–73, 2011. c Springer-Verlag Berlin Heidelberg 2011
Using Semantic Information to Answer Complex Questions
69
complex questions that requires the use of more complex semantic analysis, the approaches with only TF–IDF are often inadequate to perform fine-level textual analysis. In this paper, we extensively study the impact of using semantic information in the random walk framework for answering complex questions. We apply the Extended String Subsequence Kernel (ESSK) [8] to include semantic information by incorporating disambiguated word senses. We run all experiments on the DUC2 2007 data. Evaluation results show the effectiveness of our approach.
2
Background and Proposed Framework
2.1
Graph-Based Random Walk
In [4], the concept of graph-based centrality is used to rank a set of sentences, in producing generic multi-document summaries. A similarity graph is produced for the sentences in the document collection. In the graph, each node represents a sentence. The edges between nodes measure the cosine similarity between the respective pair of sentences where each sentence is represented as a vector of term specific weights. The term specific weights in the sentence vectors are products of local and global parameters. The model is known as term frequency-inverse document frequency (TF–IDF) model. To apply the LexRank in a query-focused context, a topic-sensitive version of LexRank is proposed in [11]. The score of a sentence is determined by a mixture model of the relevance of the sentence to the query and the similarity of the sentence to other high-scoring sentences. The relevance of a sentence s to the question q is computed by: rel(s|q) = log (tfw,s + 1) × log (tfw,q + 1) × idfw w∈q
where, tfw,s and tfw,q are the number of times w appears in s and q, respectively. A sentence that is similar to the high scoring sentences in the cluster should also have a high score. For instance, if a sentence that gets high score based on the question relevance model is likely to contain an answer to the question, then a related sentence, which may not be similar to the question itself, is also likely to contain an answer. This idea is captured by the following mixture model [11]: p(s|q) = d ×
rel(s|q) sim(s, v) + (1 − d) × × p(v|q) z∈C rel(z|q) z∈C sim(z, v)
(1)
v∈C
2.2
Our Approach
We claim that for a complex task like answering complex questions where the relatedness between the query sentences and the document sentences is an important factor, the graph-based method of ranking sentences would perform better 2
Document Understanding Conference– http://duc.nist.gov/
70
Y. Chali, S.A. Hasan, and K. Imam
if we could encode the semantic information instead of just the TF–IDF information in calculating the similarity between sentences. Thus, our mixture model for answering complex questions is: p(s|q) = d × SEM SIM (s, q) + (1 − d) × SEM SIM (s, v) × p(v|q) (2) v∈C
where, SEMSIM(s,q) is the normalized semantic similarity between the query (q) and the document sentence (s) and C is the set of all sentences in the collection. In this paper, we encode semantic information using ESSK [7] and calculate the similarity between sentences. We reimplemented ESSK considering each word in a sentence as an “alphabet”, and the alternative as its disambiguated sense [3] that we find using our Word Sense Disambiguation (WSD) System [2]. We use a dictionary based disambiguation approach assuming one sense per discourse. We use WordNet [5] to find the semantic relations among the words in a text. We assign weights to the semantic relations. Our WSD technique can be decomposed into two steps: (1) building a representation of all possible senses of the words and (2) disambiguating the words based on the highest score. We use an intermediate representation (disambiguation graph) to perform the WSD. We sum the weight of all edges leaving the nodes under their different senses. The one sense with the highest score is considered the most probable sense. In case of tie between two or more senses, we select the sense that comes first in WordNet, since WordNet orders the senses of a word by decreasing order of their frequency.
3 3.1
Evaluation and Analysis Task Definition
In this research, we consider the main task of DUC 2007 to run our experiments. The task was: “Given a complex question (topic description) and a collection of relevant documents, the task is to synthesize a fluent, well-organized 250-word summary of the documents that answers the question(s) in the topic”. We choose 35 topics randomly from the given dataset and generate summaries for each of them according to the task guidelines. 3.2
Automatic Evaluation
We carried out the automatic evaluation of our summaries using ROUGE [9] toolkit (i.e. ROUGE-1.5.5 in this study). The comparison between the TF–IDF system and the ESSK system is presented in Table 1. To compare our systems’ performance with the state-of-the-art systems, we also list the ROUGE scores of the NIST baseline system (defined in DUC-2007) and the best system in DUC-2007. The NIST baseline system generated the summaries by returning all the leading sentences (up to 250 words) in the T EXT field of the most recent document(s). Analysis of the results show that the ESSK system improves the ROUGE-1 and ROUGE-SU scores over the TF–IDF system by 0.26%, and 1.48%, respectively whereas the ESSK system performs closely to the best system besides beating the baseline system by a considerable margin.
Using Semantic Information to Answer Complex Questions
71
Table 1. ROUGE F-scores for all systems Systems ROUGE-1 ROUGE-SU TF–IDF 0.379 0.135 ESSK 0.380 0.137 NIST Baseline 0.334 0.112 Best System 0.438 0.174
3.3
Manual Evaluation
Even if the ROUGE scores had significant improvement, it is possible to make bad summaries that get state-of-the-art ROUGE scores [12]. So, we conduct an extensive manual evaluation in order to analyze the effectiveness of our approach. Each summary is manually evaluated for a Pyramid-based evaluation of contents and also a user evaluation is conducted to get the assessment of readability (i.e. fluency) and overall responsiveness according to the TAC 2010 summary evaluation guidelines3 . Content Evaluation. In the DUC 2007 main task, 23 topics were selected for the optional community-based pyramid evaluation. Volunteers from 16 different sites created pyramids and annotated the peer summaries for the DUC main task using the given guidelines4 . 8 sites among them created the pyramids. We used these pyramids to annotate a randomly chosen 5 peer summaries for each of our system to compute the modified pyramid scores. Table 2 shows the modified pyramid scores of all the systems including the NIST baseline system and the best system of DUC-2007. From these results we see that all the systems perform better than the baseline system and ESSK performs the best. Table 2. Modified pyramid scores for all systems Systems Modified Pyramid Scores NIST Baseline 0.139 Best System 0.349 TF–IDF 0.512 ESSK 0.547
User Evaluation. Some university graduate students judged all the system generated summaries (70 summaries in total) for readability (fluency) and overall responsiveness. The readability score reflects the fluency and readability of the summary (independently of whether it contains any relevant information) and is based on factors such as the summary’s grammaticality, non-redundancy, referential clarity, focus, and structure and coherence. The overall responsiveness 3 4
http://www.nist.gov/tac/2010/Summarization/Guided-Summ.2010.guidelines. html http://www1.cs.columbia.edu/~ becky/DUC2006/2006-pyramid-guidelines.html
72
Y. Chali, S.A. Hasan, and K. Imam
score is based on both content (coverage of all required aspects) and readability. The readability and overall responsiveness is each judged on a 5-point scale between 1 (very poor) and 5 (very good). Table 3 presents the average readability and overall responsive scores of all the systems. Again, the NIST–generated baseline system’s scores and the best DUC-2007 system’s scores are given for meaningful comparison. The results show that the ESSK system improves the readability and overall responsiveness scores over the TF–IDF system by 30.61%, and 42.17%, respectively while it performs closely to the best system’s scores besides beating the baseline system’s overall responsiveness score by a significant margin. Table 3. Readability and overall responsiveness scores for all systems Systems Readability Overall Responsiveness NIST Baseline 4.24 1.80 Best System 4.11 3.40 TF–IDF 2.45 2.30 ESSK 3.20 3.27
4
Conclusion
In this paper, we used semantic information and showed its impact in measuring the similarity between the sentences in the random walk framework for answering complex questions. We used Extended String Subsequence Kernel (ESSK) to include semantic information by applying disambiguated word senses. We evaluated the systems automatically using ROUGE and reported an extensive manual evaluation to further analyze the performance of the systems. Comparisons with the state-of-the-art systems showed effectiveness of our proposed approach.
Acknowledgments The research reported in this paper was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada – discovery grant and the University of Lethbridge.
References 1. Amigo, E., Gonzalo, J., Peinado, V., Peinado, A., Verdejo, F.: An Empirical Study of Information Synthesis Tasks. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp. 207–214 (2004) 2. Chali, Y., Joty, S.R.: Word Sense Disambiguation Using Lexical Cohesion. In: Proceedings of the 4th International Conference on Semantic Evaluations, pp. 476– 479. ACL, Prague (2007)
Using Semantic Information to Answer Complex Questions
73
3. Chali, Y., Hasan, S.A., Joty, S.R.: Improving Graph-based Random Walks for Complex Question Answering Using Syntactic, Shallow Semantic and Extended String Subsequence Kernels. Information Processing and Management (2010) (in Press, Corrected Proof), http://www.sciencedirect.com/science/article/ B6VC8-51H5SB4-1/2/4f5355410ba21d61d3ad9f0ec881e740 4. Erkan, G., Radev, D.R.: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research 22, 457–479 (2004) 5. Fellbaum, C.: WordNet - An Electronic Lexical Database. MIT Press, Cambridge (1998) 6. Harabagiu, S., Lacatusu, F., Hickl, A.: Answering complex questions with random walk models. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 220–227. ACM, New York (2006) 7. Hirao, T., Suzuki, J., Isozaki, H., Maeda, E.: Dependency-based Sentence Alignment for Multiple Document Summarization. In: Proceedings of COLING 2004, pp. 446–452. COLING, Geneva (2004) 8. Hirao, T., Suzuki, J., Isozaki, H., Maeda, E.: NTT’s Multiple Document Summarization System for DUC2003. In: Proceedings of the Document Understanding Conference (2003) 9. Lin, C.Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of Association for Computational Linguistics, Barcelona, Spain, pp. 74– 81 (2004) 10. Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of the Conference of Empirical Methods in Natural Language Processing, Barcelona, Spain (2004) 11. Otterbacher, J., Erkan, G., Radev, D.R.: Using Random Walks for Questionfocused Sentence Retrieval. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 915–922 (2005) 12. Sj¨ obergh, J.: Older Versions of the ROUGEeval Summarization Evaluation System Were Easier to Fool. Information Processing and Management 43, 1500–1505 (2007)
Automatic Semantic Web Annotation of Named Entities Eric Charton, Michel Gagnon, and Benoit Ozell ´ Ecole Polytechnique de Montr´eal, Montr´eal, H3T 1J4, Qu´ebec, Canada {eric.charton,michel.gagnon,benoit.ozell}@polymtl.ca
Abstract. This paper describes a method to perform automated semantic annotation of named entities contained in large corpora. The semantic annotation is made in the context of the Semantic Web. The method is based on an algorithm that compares the set of words that appear before and after the name entity with the content of Wikipedia articles, and identifies the more relevant one by means of a similarity measure. It then uses the link that exists between the selected Wikipedia entry and the corresponding RDF description in the Linked Data project to establish a connection between the named entity and some URI in the Semantic Web. We present our system, discuss its architecture, and describe an algorithm dedicated to ontological disambiguation of named entities contained in large-scale corpora. We evaluate the algorithm, and present our results.
1
Introduction
Semantic Web is a web of data. This web of data is constructed with documents that are, unlike HTML files, RDF1 assertions establishing links between facts and things. RDF documents, like HTML documents, are accessible through URI2 . A set of best practices for publishing and connecting RDF semantic data on the Web is referred by the term Linked Data. An increasing number of data providers have delivered Linked Data documents over the last three years, leading to the creation of a global data space containing billions of RDF assertions. For the usability of the Semantic Web, a new breed of smarter applications must become available. To encourage the emergence of such innovative softwares, we need NLP solutions that can effectively establish a link between documents and Semantic Web data. In this paper, we propose a general schema of automatic annotation, using disambiguation resources and algorithms, to establish relations between named entities in a text and the ontological standardized semantic content of the Linked Data network. 1 2
Resource Description Framework, is an official W3C Semantic Web specification for metadata models. Uniform Resource Identifier (URI) is the name of the string of characters used to identify a resource on the Internet.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 74–85, 2011. c Springer-Verlag Berlin Heidelberg 2011
Automatic Semantic Web Annotation of Named Entities
75
This article is structured as follows: section 2 investigates the annotation task problem from a broad perspective and describes the features of semantic annotation task in the context of Semantic Web; section 3 describes the proposed system architecture and its implementation. In section 4 we present the experiment and corpora on which the evaluation has been done. Finally, section 5 comments the results obtained by our system. We conclude and give some perspectives in section 6.
2
Problem Description
The basic principle of annotations is to add information to a source text. In a computer perspective, annotations can take various forms, but their function is always the same: to introduce complementary information and knowledge into a document. Two main kinds of information can be attributed to a word or a group of words by an annotation process : a fixed class label defined by a taxonomy standard or a link to some external knowledge. A class description can be assigned to a word or a group of words called a Named Entity (NE). By class, we mean a label describing the nature of the object expressed by the words. This object can be, for example, a person, an organization, a product, or a location. Attribution of such class is the Named Entity Recognition (NER) task, widely investigated ([2,1,11]).The granularity of classes contained in a NE taxonomy can be highly variable ([13]) but strictly, NER task is a classification task, whose purpose is to assign to a sequence of words a unique class label. Label will be for example PERS to describe a person, or ORG for an organization, and so on. This means that NER task is unable to introduce any more complementary information into the text. It is possible to introduce an upper level of granularity in the NE taxonomy model (for example, we can distinguish two kinds of places, LOC.ADMI for a city and LOC.GEO for a National Park) but with strong limitations. Thus, there is no way to introduce data like birth date of a person or ground surface of a city. To achieve this task of associating properties to NE, an upper level of annotation is needed, expressed by a relation between NE and an external knowledge. It consists in assigning to an identified NE a link to a structured external knowledge base, like the one delivered on the Semantic Web. This is the Semantic Annotation (SA) task, previously investigated by ([10,7]). 2.1
Entity Labeling versus Semantic Labeling
The example in Figures 2 and Table 2 illustrates the difference between SA and NER and its implication on knowledge management. Let’s consider a sample text to annotate, as presented in Table 1. The first level of ambiguity encountered by the NER task is related to the words polysemy. To illustrate this we show in Figure 1 the numerous possible concept-class values available for the Paris word. The main objective of the NER task is to manage this first level of disambiguation, generally through statistical
76
E. Charton, M. Gagnon, and B. Ozell Table 1. A sample document to label with various named entities contained in
Paris is a town in Oneida County, New York, USA. The town is in the southeast part of the county and is south of Utica. The population was 4,609 at the 2000 census. The town was named after an early benefactor, Colonel Isaac Paris.
Fig. 1. Ambiguity of a class label for a named entity like Paris. It can be a city, and asteroid, a movie, a music album or a boat.
methods ([2], [8]). The NER task results in a text where NE are labeled by classes, as presented in Table 2. But despite the NE labeling process, we can show that a level of ambiguity is still present. Paris is correctly annotated with the LOC (locality) class label, but this class is not sufficient to determine precisely which locality it is, according to the numerous existing cities that are also named Paris (Figure 2). Table 2. Sample of word with standard NE labels in the document Paris{LOC} is a town in Oneida County{LOC}, New York{LOC}, USA{LOC}. The town is in the southeast part of the county and is south of Utica{LOC}. The population was 4,609{AMOUNT} at the 2000 census{DATE}. The town was named after an early benefactor, Colonel Isaac Paris{PERS}.
Paris as a LOC
France
Kentucky
Idaho
Ontario
Maine
Tenessee
Fig. 2. Ambiguity of entity for a same NE class label: the Paris word, even with its Location class, is still ambiguous
2.2
Previous Semantic Labeling Propositions
The task of SA has received an increasing attention in the last few years. A general survey of all the semantic annotation techniques have been proposed by
Automatic Semantic Web Annotation of Named Entities
77
([15]). None of the described systems have been integrated in the general schema of Semantic Web. They are all related to specific and proprietary or non-standard ontological representations. The KIM platform ([10]) provides a two-step labeling process including a NER step to attribute NE labels to words before establishing the semantic link. The semantic descriptions of entities and relations between them are kept in a knowledge base encoded in the KIM ontology and resides in the same “semantic repository”. SemTag ([5]) is another example of a tool that focuses only on automatic mark-up. It is based on IBM’s text analysis platform Seeker and uses similarity functions to recognize entities that occur in contexts similar to marked up examples. The key problem with large-scale automatic mark-up is ambiguity. A Taxonomy Based Disambiguation (TBD) algorithm is proposed to tackle this problem. SemTag can be viewed as a bootstrapping solution to get a semantically tagged collection off the ground. Recently, ([9]) presented Moat, a proposition to bridge the gap between tagging and Linked Data. Its goal is to provide a simple and collaborative way to annotate content thanks to existing URI with as little effort as possible and by keeping freetagging habits. However, Moat does not provide an automatic generic solution to establish a link between text and an entry point in the Linked Data Network. 2.3
The Word Sense Disambiguation Problem
The problem with those previous propositions is related to the Word Sense Disambiguation (WSD). WSD consists in determining which sense of a word is used when it appears in a particular context. KIM and Semtag, when they establish a link between a labeled NE and an ontology instance, need a complementary knowledge resource to deal with the homonymic NEs of a same class. For the NER task, this resource can be generic and generative: a labeled corpus used to train a statistical labeling tool (CRF, SVM, HMM). This statistical NER tool will be able to infer a class proposition through its training from a limited set of contexts. But this generative approach is not applicable to the SA task, as each NE to link to a semantic description has a specific word context, marker of its exact identity. Many propositions have been done to solve this problem. Recently, ([16]) suggest to use the LSA3 techniques mixed with cosine similarity measure to disambiguate terms in the perspective of establishing a semantic link. The Kim system ([10]) re-uses the Gate platform and its NLP components and apply rules to establish a disambiguated link. Semtag uses two kinds of similarity functions: bayesian, and cosinus. But the remaining problem for all those propositions is the lack of access to an exhaustive and wide knowledge of contextual information related to the identity of the NE. For our previous Paris example, those systems could establish a disambiguated link between any Paris NE and its exact Linked Data representation only if they have access to
3
Latent Semantic Analysis is a technique of analyzing relationships between a set of documents and terms using term-document matrix built from Singular Value Decomposition.
78
E. Charton, M. Gagnon, and B. Ozell
Semantic Disambiguation Algorithm (SDA) Best Cosine Score Cosine Similarity mesure (Words.TF.IDF,{town.Oneida;County;New York ...}) Linked Data Interface (LDI)
Metadata containers E)
Surface forms
(E.r)
Words:TF.IDF
(E.c)
LinkedData
(E.rdf)
Paris, Paris New York York:69,Cassvile:58,Oneida:52 ... http://dbpedia.org/data/Paris,_New_York.rdf Paris, Paname, Lutece France:342;Seine:210;Eiffel:53 ... http://dbpedia.org/data/Paris.rdf Paris Kentucky:140,Varden:53,Bourbon:37 http://dbpedia.org/data/Paris,_Kentucky.rdf
Semantic Link Linked Data
Fig. 3. Architecture of the system with metadata used as Linked Data Interface (LDI) and Semantic Disambiguation Algorithm (SDI)
an individual usual word contextual modelized resource. Unfortunately, such a knowledge is not present in RDF triples of the LinkedData network, neither in standard exhaustive ontologies like DBPedia.
3
Our Proposition: A Linked Data Interface
To solve this problem, we propose a SA system that uses an intermediate structure to determine the exact semantic relation between a NE and its ontological representation on the Linked Data network. In this structure, called Linked Data Interface (LDI), there is an abstract representation for every Wikipedia article. Each one of these abstract representations contains a pointer to the Linked Data document that provides an RDF description of the entity. The disambiguation task is achieved by identifying the item in the LDI that is most similiar to the context of the named entity (the context is represented by the set of words that appear before and after the NE). This algorithm is called Semantic Disambiguation Algorithm (SDA). The architecture of this semantic labeling system is presented in Figure 3. 3.1
The Linked Data Interface (LDI)
To each entity that is described by an entry in Wikipedia, we associate some metadata, composed of three elements: (i) a set of surface forms, (ii) the set of words that are contained in the entity description, where each word is accompanied by its tf.idf weight ([12]), and (iii) an URI that points to some entity in
Automatic Semantic Web Annotation of Named Entities Graph structure extracted from Wikipedia
metadata surface forms
E.r
79
[
Mirage Mirage Jet Mirage aircraft Dassault Mirage F1C Dassault Mirage
Fig. 4. All possible surface forms are collected from multiple linguistic editions of Wikipedia and transferred into a set E.r. Here two complementary surface forms for a plane name are collected from the German edition.
the Linked Data Network. The tf.idf value associated to a word is its frequency in the Wikipedia document, multiplied by a factor that is inversely proportional to the number of Wikipedia documents in which the word occurs (the exact formula is given below). The set of surface forms for an entity is obtained by taking every Wikipedia entry that points to it by a redirection link, every entry that corresponds to its description in another language and, finally, in every disambiguation page that points to this entity, the term in the page that is associated to this pointer. As an example, the surface form set for the NE Paris (France) contains 39 elements, (eg. Ville Lumi`ere, Ville de Paris, Paname, Capitale de la France, D´epartement de Paris). In our application, the surface forms are collected from five linguistic editions of Wikipedia (English, German, Italian, Spanish and French). We use such cross-linguistic resource because in some cases, a surface form may appear only in a language edition of Wikipedia that is not the one of the source text. A good example of this is given by the Figure 4. In this example, we see that the surface form Dassaut Mirage is not available in the English Wikipedia but can be collected from the German edition of Wikipedia. The structure of Wikipedia and the sequential process to build metadata like ours, has been described previously ([3,4]). We will now define more formally the LDI. Let C be the Wikipedia corpus. C is partitioned into subsets C l representing linguistic editions of Wikipedia (i.e fr.wikipedia.org or en.wikipedia.org, which are independent language sub-corpus of the whole Wikipedia). Let D be a Wikipedia article. Each D ∈ C l is represented by a triple (D.t, D.c, D.l), where D.t is the title of the article, made of a unique word sequence, D.c is a collection of terms w contained in the article, D.l is a set of links between D and other Wikipedia pages of C. Any link in D.l can be an internal redirection inside C l (a link from a redirection page or a disambiguation page) or in another document in C (in this case, a link to the same article in another language).
80
E. Charton, M. Gagnon, and B. Ozell
The LDI may now be described the following way. Let E ∈ LDI be a metadata container that corresponds to some D ∈ C. E is a tuple (E.t, E.c, E.r, E.rdf ). We consider that E and D are in relation if and only if E.t = D.t. We say that E represents D, which will be noted E → D. E.c contains pairs built with all words w of D.c associated with their tf.idf value calculated from C l . The tf.idf weight for a term wi that appears in document dj is the product of the two values tf and idf which are calculated as shown in equations 1 and 2. In the definition of idf , the denominator |{d : d ∈ C l , wi ∈ d}| is the number of documents where the term wi appears. tf is expressed by equation 2, where wi,j is the number of occurrences of the term wi in document dj , and the denominator is the sum of number of occurrences of all terms in document dj . |C l | (1) |{d : d ∈ C l , wi ∈ d}| wi,j tfi,j = (2) k wk,j The E.c part of a metadata container must be trained for each language. In our LDI the three following langages have been considered: English, French and Spanish. The amount of representations collected can potentially elaborate semantic links for 745 k different persons or 305 k organizations in English, 232 k persons, and 183 k products in French. The set of all surface forms related to a document D is built by taking all the titles of special documents (i.e redirection or disambiguation pages) targeted by the links contained in D.l, and stored in E.r. The E.rdf part of the metadata container must contain a link to one or more entry points of the Linked Data network. An entry point is an URI, pointing to an RDF document that describes the entity represented by E. As an example, http://dbpedia.org/data/Spain.rdf is the entry point of the DBpedia instance related to Spain inside the Linked Data network. The special interest of DBpedia for our application is that the ontology is a mirror of Wikipedia. Any English article of Wikipedia (and most French and Spanish ones) is supposed to have an entry in DBpedia. DBpedia delivers also correspondence files between others entry point in the Linked Data Network and Wikipedia records4 : for example, another entry point for Spain in the Linked Data Network is on the CIA Factbook RDF collection5 . We use those table files to create E.rdf . For our experiments, we included in E.rdf only the link to the DBPedia entry point in the Linked Data Network. idfi = log
3.2
Semantic Disambiguation Algorithm (SDA)
To identify a named entity, we compare it with every metadata container Ei ∈ LDI. Each Ei that contains at least one surface form that corresponds to the 4 5
See on http://wiki.DBpedia.org/Downloads34 files named Links to Wikipedia articles. http://www4.wiwiss.fu-berlin.de/factbook/resource/Spain
Automatic Semantic Web Annotation of Named Entities
81
named entity surface form in the text is added into the candidate set. Now, for each candidate, its set of words Ei .c is used to calculate a similarity measure with the set of words that forms the context of the named entity in the text. In our application, the context consists of the n words that come immediately before and after the NE. The tf.idf is used to calculate this similarity measure. The Ei that gets the higher similarity score is selected and its URI pointer Ei .rdf is used to identify the entity in Linked Data that corresponds to the NE in the text. Regarding the candidate set CS that has been found for the NE to be disambiguated, three situations can occur: 1. CS = ∅: there is no metadata container for N E. 2. |CS| = 1: there is only one metadata container available to establish a semantic link between EN and an entity in the Linked Data Network. 3. |CS| > 1: there are more than one possible relevant metadata container, among which at most one must be selected. Case 1 is trivial (no semantic link available). For cases 2 and 3, a cosine similarity measure (see equation 3) is applied to NE context S.w and E.ctf.idf for every metadata container E ∈ CS. As usual, the vectors are formed by considering each word as a dimension. If a word appears in the NE context, we put the value 1 in its position in the vector space, 0 otherwise. For E.c, we put in the vector the tf.idf values. The similarity values are used to rank every E ∈ CS. cosinus(S, E) =
S.w · E.ctf.idf S.w E.ctf.idf
(3)
Finally the best candidate EΩ according to the similarity ranking is chosen if its similarity value is higher than the threshold value α, as described in 4. The algorithm derived from this method is presented in Table 3. ∀Ei ∈ CS {Eω = argmax(cosinus(S, Ei ))} EΩ =
4
∅ Eω
if score(Eω ) ≤ α otherwise
(4)
Experiments
There is no standard evaluation schema for applications like the one described in this paper. There are many metrics (precision, recall, word error rates) and annotated corpus for NER task, but none of them includes a Gold Standard for Semantic Web annotation. We evaluated our system with an improved standard NER test corpus. We associate to each NE of such corpus a standard Linked Data URI coming from DBpedia. This proposal has the following advantage. DBpedia
82
E. Charton, M. Gagnon, and B. Ozell Table 3. Pseudo code of Semantic Disambiguation Algorithm (SDA)
SDA Function: rdf = SDA( sf , S[]) Input: sf = surface form of detected N E to link S[] = contextual words of EN Output: rdf = uri link between EN and Linked Data entry point
SDA Local variables: E[]=metadata CS[]=Candidate Set of metadata α=threshold value Algorithm: (1) CS[]=search all E[] where E[].c match sf (2) if (CS[] == null) return null (3) for x = all CS[] (3.1) CS[x].score=cosinus(CS[x].w : T F.idf [], S[]) (4) order CS[] by descending CS[].score (5) if (CS[0].score > α ) return CS[0].rdf (5.1) else return null
is now one of the most known and accurate RDF resource. Because of this, DBpedia evolved as a reference interlinking resource6 to the Linked Data semantic network7 . The NER corpora used to build semantically annotated corpora are described below. Test Corpora The base corpus for French semantic annotation evaluation is derived from the French ESTER 2 Corpus ([6]). The named entity (NE) detection task on French in ESTER 2 was proposed as a standard one. The original NE tag set consists of 7 main categories (persons, locations, organizations, human products, amounts, time and functions) and 38 sub-categories. We only use PERS, ORG, LOC, and PROD tags for our experiments8 . The English evaluation corpus is the Wall Street Journal (WSJ) version from the CoNLL Shared Task 2008 ([14]). NE categories of WSJ corpus include: Person, Organization, Location, Geo-Political Entities, Facility, Money, Percent, Time and Date, based on the definitions of these categories in MUC and ACE7 tasks. 4.1
Gold Standard Annotation Method
To build test corpora, we used a semi-automatic method. We first applied our semantic annotator and then removed or corrected manually the wrong semantic 6 7 8
See http://wiki.dbpedia.org/Interlinking DBpedia is now an rdf interlinking resource for CIA World Fact Book, US Census, Wikicompany, RDF Wordnet and more. This selection is made because Cardinal (amounts) and temporal values (time) are specific entities involving different semantic content than named entities
Automatic Semantic Web Annotation of Named Entities
83
Table 4. All NE contained in a text document does not have necessarily a corresponding representation in LDI. This Table shows the coverage of built metadata contained in LDI, regarding NE contained in the French ESTER 2 test corpus and in the English WSJ CoNLL 2008 test corpus. ESTER 2
2009 (French)
Labels
Entities in Equivalent test corpus entities in LDI PERS 1096 483 ORG 1204 764 LOC 1218 1017 PROD/GPE 59 23 Total 3577 2287
Coverage (%) 44% 63% 83% 39% 64%
WSJ CoNLL 2008 (English) Entities in Equivalent test corpus entities in LDI 612 380 1698 1129 739 709 61 60 3110 2278
Coverage (%) 62% 66% 96 % 98 % 73%
links. For some NE, the Linked Data Interface does not provide semantic links. This is the problem of coverage, managed by the use of the α threshold value. Level of coverage for the two test corpus in French and English is given in Table 4.
5
Results
To evaluate the performances of SA we applied it to the evaluation corpora with only Word, POS and NE. Two experiments have been done. First, we verify the annotation process under the scope of quality of disambiguation: we apply SA only to NEs which have their corresponding entries in LDI. This means we do not consider uncovered NE (as presented in Table 4) in the labeling experiment. We only try to label the 2287 French and 2278 English covered NEs. Those results are given in the section [no α] of Table 5. Then, we verify the capacity of SA to annotate a text, with potentially no entry in LDI for a given NE. This means we try to label the full set of NEs (3577 French and 3110 in English) and to assign the NORDF label when no entry is available in LDI. We use the threshold value9 as a confidence weight score to assign as annotation an URI link or a NORDF label. Those results are given in Table 5 in the section [α]. We used recall measure (as in 5) to evaluate the amount of correctly annotated NEs according to the Gold Standard. T otal of correct annotations → N E (5) N E total Our results indicate a good level of performance for our system, in both language with over .90 of recall in French and .86 in English. The lower performances in English task can be explained by the structural difference of metadata in the two languages: near 0.7 million metadata containers are available in French and more than 3 millions in English (according to each local Wikipedia size). A Recall =
9
α value is a cosine threshold selected empirically and is positioned for this experiment on 0.10 in French and 0.25 in English.
84
E. Charton, M. Gagnon, and B. Ozell
Table 5. Results of the semantic labeler applied on the ESTER 2 and WSJ CoNLL 2008 test corpus
NE PERS ORG LOC PROD Total
French [no α] 483 764 1017 23 2287
tests Recall 0.96 0.91 0.94 0.60 0.93
[α] 1096 1204 1218 59 3577
Recall 0.91 0.90 0.92 0.50 0.90
English [no α] 380 1129 709 60 2278
tests Recall 0.93 0.85 0.84 0.85 0.86
[α] 612 1608 739 61 3020
Recall 0.94 0.86 0.82 0.85 0.86
biggest amount of metadata containers means also more propositions of synonymic words for a specific NE and a higher risk of bad disambiguation by the cosine algorithm. A way to solve this specific problem could be to weight the tf.idf according to the amount of available metadata containers. The slight improvement of recall on English [α] experiment is attributed to the better detection of NORDF NEs, due to the difference of NE classes representation between the French and the English Corpora.
6
Conclusions and Perspectives
In this paper, we presented a system to semantically annotate any named entity contained in a text, using a URI link. The URI resource used is a standard one, compatible with the Semantic Web network Linked Data. We have introduced the concept of Linked Data Interface, an exhaustive statistical resource containing contextual and nature description of potential semantic objects to label. The Linked Data Interface gives a possible answer to solve the problem of ambiguity resolution for an exhaustive semantic annotation process. This system is a functional proposition, available now, to establish automatically a relation between the vast amount of entry points available on the Linked Data network and named entities contained in an open text. We have shown that a large and expandable Link Data Interface of high quality containing millions of contextual descriptions for potential semantic entities, available in various languages, can be derived from Wikipedia and DBpedia. We proposed an evaluation schema of semantic annotators, using standard corpora, improved with DBpedia URI annotations. As our evaluation shows, our system can establish semantic relations automatically, and can be introduced in a complete annotation pipeline behind a NER tools.
References 1. Bikel, D., Schwartz, R., Weischedel, R.: An algorithm that learns whats in a name. Machine learning 7 (1999) 2. Borthwick, A., Sterling, J., Agichtein, E.: R: Exploiting diverse knowledge sources via maximum entropy in named entity. In: Proc. of the Sixth Workshop on Very Large Corpora, pp. 152–160 (1998)
Automatic Semantic Web Annotation of Named Entities
85
3. Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, vol. 6 (2006) 4. Charton, E., Torres-Moreno, J.: NLGbAse: a free linguistic resource for Natural Language Processing systems. In: LREC 2010: Proceedings of LREC 2010, Malta, vol. (1) (2010) 5. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J., et al.: SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation. In: Proceedings of the 12th International Conference on World Wide Web, p. 186. ACM, New York (2003) 6. Galliano, S., Gravier, G., Chaubard, L.: The ESTER 2 Evaluation Campaign for the Rich Transcription of French Radio Broadcasts. In: International Speech Communication Association Conference 2009, pp. 2583–2586 (2009); Interspeech 2010 7. Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation, indexing, and retrieval. Web Semantics: Science, Services and Agents on the World Wide Web 2(1), 49–79 (2004) 8. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289, Citeseer (2001) 9. Passant, A., Laublet, P.: Meaning of a tag: A collaborative approach to bridge the gap between tagging and linked data. In: WWW 2008 Workshop Linked Data on the Web (2008) 10. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: KIM – semantic annotation platform. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 834–849. Springer, Heidelberg (2003) 11. Ratinov, L., Roth, D.: Design Challenges and Misconceptions in Named Entity Recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning. International Conference On Computational Linguistics (2009) 12. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval* 1. Information Processing & Management (1988) 13. Sekine, S., Sudo, K., Nobata, C.: Extended named entity hierarchy. In: Proceedings of the LREC-2002 Conference, pp. 1818–1824, Citeseer (2002) 14. Surdeanu, M., Johansson, R., Meyers, A.L.: The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In: Proceedings of the CoNLL, p. 159 (2008) 15. Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargasvera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Web Semantics: Science, Services and Agents on the World Wide Web 4(1), 14–28 (2006) 16. Zelaia, A., Arregi, O., Sierra, B.: A multiclassifier based approach for word sense disambiguation using Singular Value Decomposition. In: Proceedings of the Eighth International Conference on Computational Semantics - IWCS-8 2009, p. 248 (January 2009)
Learning Dialogue POMDP Models from Data Hamid R. Chinaei and Brahim Chaib-draa Computer Science and Software Engineering Department, Laval University, Quebec, Canada [email protected], [email protected]
Abstract. In this paper, we learn the components of dialogue POMDP models from data. In particular, we learn the states, observations, as well as transition and observation functions based on a Bayesian latent topic model using unannotated human-human dialogues. As a matter of fact, we use the Bayesian latent topic model in order to learn the intentions behind user’s utterances. Similar to recent dialogue POMDPs, we use the discovered user’s intentions as the states of dialogue POMDPs. However, as opposed to previous works, instead of using some keywords as POMDP observations, we use some meta observations based on the learned user’s intentions. As the number of meta observations is much less than the actual observations, i.e. the number of words in the dialogue set, the POMDP learning and planning becomes tractable. The experimental results on real dialogues show that the quality of the learned models increases by increasing the number of dialogues as training data. Moreover, the experiments based on simulation show that the introduced method is robust to the ASR noise level.
1 Introduction Consider the following example taken from the dialogue set SACTI-2 [6], where SACTI stands for Simulated ASR-Channel Tourist Information: U1 Is there a good restaurant we can go to tonight U’1 [Is there a good restaurant week an hour tonight] M1 Would you like an expensive restaurant U2 No I think we’d like a medium priced restaurant U’2 [ No I think late like uh museum price restaurant] M2 Cheapest restaurant is eight pounds per person The first line shows the first user’s utterance, U1. Because of Automatic Speech Recognition (ASR) this utterance is corrupted and is received by the system as U 1 in the following line in braces. M1 in the next line shows the system’s response to the user. For each dialogue utterance, the system’s goal is first to capture the user’s intention and then to perform the best action which satisfies the user’s intention. For instance, in the second received user’s utterance, U’2 [No I think late like uh museum price restaurant], the system has difficulty in finding the user’s intention. In fact, in U 2, the system is required to understand that the user is looking for a restaurant; though this utterance is C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 86–91, 2011. c Springer-Verlag Berlin Heidelberg 2011
Learning Dialogue POMDP Models from Data
87
Fig. 1. Intentions learned by HTMM for SACTI-1, with their 20-top words and their probabilities
highly corrupted. Specifically, it contains misleading words such as museum that can be strong observations for another user’s intention, i.e. user’s intention for museums. Recently, there has been a great interest for modelling the dialogue manager (DM) of spoken dialogue systems (SDS) using Partially Observable Markov Decision Processes (POMDPs) [8]. However, in POMDPs, similar to many other machine learning frameworks, estimating the environment dynamics is a significant issue; as it has been argued previously, for instance in [4]. In other words, the POMDP models highly impact the planned strategies. Nevertheless, a good learned model can result in desired strategies. Moreover, it can be used as a prior model in all Bayesian approaches so that the model be further updated and enhanced. As such, in this work we are interested in learning proper POMDP models for dialogue POMDPs based on human-human dialogues. In this paper, we present a method for learning the components of dialogue POMDP models using unannotated data available in SDSs. In fact, using an unsupervised method based on Dirichlet distribution, one can learn states and observations as well as transition and observation POMDP functions. In addition, we develop a simple idea for reducing the number of observations while learning the model, and define a small practical set of observations for the designed dialogue POMDP.
2 Capturing Dialogue POMDP Model for SACTI-1 This section describes the method for learning POMDP transition and observation functions. For background about POMDPs, the reader is referred to [5]. We used Hidden Topic Markov Model (HTMM) [3] to design a dialogue POMDP, for SACTI-1 dialogues [7], publicly available at: http://mi.eng.cam.ac.uk/projects/sacti/ corpora/. There are about 144 dialogues between 36 users and 12 experts who play the role of a DM for 24 total tasks on this data set. Similar to SACTI-2, the utterances here are also first confused using a speech recognition error simulator, and then are sent to the human experts. For an application of HTMM on dialogues in particular for learning states of the domain, the reader is referred to [1]. Figure 1 shows 3 captured user’s intentions and their top 20 words with their probabilities learned by HTMM. For each intention, we have highlighted the keywords which best distinguish the intention. These intentions are for the user’s intentions for request information about some visiting places, the transportation, and food places, respectively.
88
H.R. Chinaei and B. Chaib-draa
Without loss of generality, we can consider the user’s intention as the system’s state [2]. Based on the above captured intentions, we defined 3 primary states for the SACTI-1 DM as follows: visits (v) , transports (t) , and foods (f). Moreover, we defined two absorb states, i.e., Success (S) and Failure (F) for dialogues which end successfully and unsuccessfully, respectively. The notion of successful or unsuccessful dialogue is defined by user. After finishing each dialogue, the user assigns the level of precision and recall. These are the only explicit feedback which we require from the user, to be able to define absorb states of dialogue POMDP. A dialogue is successful if its precision and recall is above a predefined threshold. The set of actions are coming directly from SACTI-1 dialogue set, and they include: GreetingFarewell, Inform, StateInterp, IncompleteUnknown, Request, ReqRepeat, RespondAffirm, RespondNegate, ExplAck, ReqAck, etc. For instance GreetingFarewell is used for initiating or ending a dialogue, Inform is for giving information for a user’s intention, ReqAck is for the DM’s request for user’s acknowledgement, StateInterp for interpreting the intentions of user, and it can be considered as implicit confirmation, etc. The transition function is calculated using maximum likelihood with add-one smoothing to make a more robust transition model: T (s1 , a1 , s2 ) =
Count(s1 , a1 , s2 ) + 1 Count(s1 , a1 ) + K
where K = |S|2 |A|, S is the state set, and S equals to number of intentions N which is 5 in our example. For each utterance U , its corresponding state is the intention with highest probability. For the choice of observation function, we assumed 5 observations, each one is specific for one state, i.e. user’s hidden intention. we use the notation O= { VO, TO, FO, SuccessO, FailureO } for the meta observations for visits, transports, foods, Success, and Failure, respectively. For each user’s intention, one can capture POMDP observations given each utterance W = {w1 , . . . , w|W | } using vector β. Notice that βwi z is the learned vector for the probability of each word wi given each user’s intention z noted as βwi z [3]. Then, in dialogue POMDP interaction, given any arbitrary user’s utterance POMDP observation o is captured as: o = argmaxz ∏ βwi z i
Then, the observation function is estimated by taking average over belief of states given each action and state. For the choice of reward model, similar to previous works we penalized each action in primary states by −1, i.e. -1 reward for each dialogue turn [8]. Moreover, actions in Success state get +50 as reward, and those which lead to Failure state get −50 reward.
3 Experiments We generated dialogue POMDP models as described in the previous section for SACTI1. The automatic generated dialogue POMDP models consist of 5 states, 14 actions and
45
35
40
30
35 30 25
POMDP Expert
20 15 10
5
Expected Rewards
Expected Rewards
Learning Dialogue POMDP Models from Data
89
25 20
POMDP Expert
15 10 5
0
0 24
48
72
96
Number of Training Data
(a)
none
low
med
high
Noise Level
(b)
Fig. 2. (a): Comparison of performance in dialogue POMDPs v.s. experts with respect to the number of expert dialogues. (b): Comparison of performance in dialogue POMDPs v.s. experts with respect to the noise level.
5 meta observations (each of which is for one state) which are drawn by HTMM using 817 primitive observations (words). We solved our POMDP models, using ZMDP software available online at: http:// www.cs.cmu.edu/˜trey/zmdp/. We set a uniform distribution on 3 primary states (visits, transports, and foods), and set discount factor to 90%. Based on simulation, we evaluated the performance of dialogue POMDP by increasing the number of expert dialogues based on the gathered rewards. Figure 2 (a) shows that by increasing expert dialogues the dialogue POMDP models perform better. In other words, by increasing data the introduced method learns better dialogue POMDP models. The only exception is when we use 48 dialogues where the dialogue POMDP performance decreases compared to when 24 dialogues were used, and it has average performance worse than performance of experts in corresponding 48 dialogues. The reason could be use of EM for learning the model which is depended on priors α and η [3]. Moreover, EM is prone to local optima. In this work, we set the priors based on heuristic given in [3], and our trial and error experiments, which is indeed a drawback for use of parametric models in real applications. Furthermore, based on our simulations, we evaluated the robustness of generated POMDP models to ASR noise. There are four levels of ASR noise: no noise, low noise, medium noise, and high noise. For each noise level, we randomly took 24 expert dialogues and made a dialogue POMDP model. Then, for each POMDP we performed 24 simulations and gathered their expected rewards, and compared to corresponding expert dialogues. Figure 2 (b) shows the results of these experiments. As the figure shows the dialogue POMDP models are more robust to ASR noise levels compared to expert dialogues. The only exception is with the presence of no noise, where the experts perform better. This also might be because of use of EM for learning model, where the model can converge in local minima. Nevertheless, our preliminarily results based on simulation shows that dialogue POMDP models are much more robust to higher levels of noise compared to expert performance.
90
H.R. Chinaei and B. Chaib-draa
Table 1. Left: Sample results of applying HTMM on SACTI-1 Right: Sample results of simulation for SACTI-1 dialogue POMDP ... U1 yeah hello this is johan schmulka uh and i’m uh searching for a bar in this town can you may be tell me where the cafe blu is U’1 [hello this is now seven four bus and do you tell me where to cafe blu is] o1 FO B1 t:0.000000 v:0.000000 f:1.000000 a1: Inform(foods) M1 cafe blu is on alexander street U2 oh um yeah how can i get to alexander street and where exactly is it i know there a shopping area on alexander street um U’2 [i am yeah i am at the alexander street and where is it was on a the center of alexander street] o2 TO B2 t:0.999992 v:0.000008 f:0.000000 a2: Inform(transports) ...
... a1: GreetingFarewell M1: How can I help you? o2: VO B1: t:0.048145 v:0.912760 f:0.039093 a2: Inform(visits) M2: Here is information about visiting areas o2: TO B2: t:0.967322 v:0.008186 f:0.024490 a3: Inform(transports) M3: Here is information about transportation o3: TO B3: t:0.993852 v:0.000314 f:0.005833 a4: ReqAck(transports) M4: Are you looking for transportation o4: TO B4: t:0.945658 v:0.048333 f:0.006008 a5: Inform(transports) M5: Here is information about transportation
Moreover, the left side of Table 1 shows a sample dialogue from SACTI-1 dialogue set after applying HTMM on dialogues. In fact, this is a sample of data used for learning dialogue POMDP model. The first line of the table shows the first user’s utterance (U 1). Because of ASR this utterance is corrupted which is the following line in braces, U 1. The next line o1 is the observation behind U 1 which is used in the time of dialogue POMDP interaction. Note that it is assumed that each user utterance corresponds to one user’s intention. So, for each system’s observation the values in the following line show the system’s belief over possible hidden intentions (B1). The next line, a1 shows the DM’s action in the form of dialogue acts. For instance, Inform(foods) is the dialogue act for the actual DM’s utterance in the following line, i.e. M1: cafe blu is on alexander street. Furthermore, the right side of Table 1 shows samples of our simulation of dialogue POMDP. In the simulation time, for instance action a1, GreetingFarewell is generated by dialogue POMDP manager, the description of this action is shown in M1, How can I help you?. Then, the observation o2 is generated by environment, VO. For instance, the received user’s utterance could have been something like U’1=I would like a hour there museum first, which easily the intention behind this can be calculated using βws and equation 1. However, notice that these results are only based on dialogue POMDP simulation; where there is no actual user’s utterance, but only simulated meta observations oi . As the table shows, dialogue POMDP performance seems intuitive. For instance, in a4 the dialogue POMDP requests for acknowledgement that the user actually looks for transports, since dialogue POMDP already informed the user about transports in a3 .
Learning Dialogue POMDP Models from Data
91
4 Conclusion and Future Work A common problem in dialogue POMDP frameworks is calculating the dialogue POMDP policy. If we can estimate the POMDP model in particular the transition, observation, and reward functions then we are able to use common dynamic programming approaches for calculating POMDP policies. In this context, [8] used POMDPs for modelling a DM and defined the observation function based on confidence scores which are in turn based on some recognition features. However, the work here is tackled differently. We consider all the words in an utterance and consider the highest intention under the utterance as the meta observation for the POMDP. This makes the work presented here particularly different from [2] where the authors simply used some state keywords together with a few other words for modelling SDS POMDP observations and observation function. However, the evaluation done here is in a rather small domain for real dialogue systems. The number of states needs to be increased and the learned model should be evaluated accordingly. Moreover, the definition of states here is a simple intention state whereas in real dialogue domains the information or dialogue states are more complex. Then, the challenge would be to compare in particular the learned observation function presented here with confidence score based ones such as in in [8], as well as keyword based ones as presented in [2].
References 1. Chinaei, H.R., Chaib-draa, B., Lamontagne, L.: Learning user intentions in spoken dialogue systems. In: Filipe, J., Fred, A., Sharp, B. (eds.) ICAART 2009. CCIS, vol. 67, pp. 107–114. Springer, Heidelberg (2010) 2. Doshi, F., Roy, N.: Spoken language interaction with model uncertainty: an adaptive humanrobot interaction system. Connection Science 20(4), 299–318 (2008) 3. Gruber, A., Rosen-Zvi, M., Weiss, Y.: Hidden topic markov models. In: Artificial Intelligence and Statistics (AISTATS), San Juan, Puerto Rico (2007) 4. Liu, Y., Ji, G., Yang, Z.: Using Learned PSR Model for Planning under Uncertainty. Advances in Artificial Intelligence, 309–314 (2010) 5. Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: An anytime algorithm for pomdps. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1025–1032 (August 2003) 6. Weilhammer, K., Williams, J.D., Young, S.: The SACTI-2 Corpus: Guide for Research Users, Cambridge University. Technical report (2004) 7. Williams, J.D., Young, S.: The SACTI-1 Corpus: Guide for Research Users. Cambridge University Department of Engineering. Technical report (2005) 8. Williams, J.D., Young, S.: Partially observable markov decision processes for spoken dialog systems. Computer Speech and Language 21, 393–422 (2007)
Characterizing a Brain-Based Value-Function Approximator Patrick Connor and Thomas Trappenberg Department of Computer Science, Dalhousie University [email protected], [email protected]
Abstract. The field of Reinforcement Learning (RL) in machine learning relates significantly to the domains of classical and instrumental conditioning in psychology, which give an understanding of biology’s approach to RL. In recent years, there has been a thrust to correlate some machine learning RL algorithms with brain structure and function, a benefit to both fields. Our focus has been on one such structure, the striatum, from which we have built a general model. In machine learning terms, this model is equivalent to a value-function approximator (VFA) that learns according to Temporal Difference error. In keeping with a biological approach to RL, the present work1 seeks to evaluate the robustness of this striatum-based VFA using biological criteria. We selected five classical conditioning tests to expose the learning accuracy and efficiency of the VFA for simple state-value associations. Manually setting the VFA’s many parameters to reasonable values, we characterize it by varying each parameter independently and repeatedly running the tests. The results show that this VFA is both capable of performing the selected tests and is quite robust to changes in parameters. Test results also reveal aspects of how this VFA encodes reward value. Keywords: Reinforcement learning, value-function approximation, classical conditioning, striatum.
1
Introduction
Over the last several decades, our understanding of RL has been advanced by psychology and neuroscience through classical/instrumental conditioning experiments and brain signal recording studies (fMRI, electrophysiological recording, etc.). Over the same period, the machine learning field has been investigating potential RL algorithms. There has been some convergence of these fields, notably the discovery that the activity of a group of dopamine neurons in the brain resembles the Temporal Difference (TD) error in TD learning [1]. One research focus in machine learning RL is the mapping of expected future reward value to states (state-value mapping) from as little experience (state-value sampling) as possible. Living things clearly grapple with this problem, continually updating 1
Funding for this work was supported in part by the Walter C. Sumner Foundation, CIHR, and NSERC.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 92–103, 2011. c Springer-Verlag Berlin Heidelberg 2011
Characterizing a Brain-Based Value-Function Approximator
93
their beliefs about expected rewards from their limited experience. Indeed, the field of classical conditioning, which relies heavily on animal behavioural experiments, has explored a variety of reward-learning scenarios. The obvious need to acquire value for a rewarding state and the need to generalize this to similar circumstances is well recognized by both psychology and machine learning. What is interesting, however, is that there appear to be other useful rewardlearning strategies expressed in classical conditioning phenomena that have not yet translated into machine learning RL. Just as generalization improves learning efficiency by spreading learned value to nearby states, the classical conditioning phenomena of "latent inhibition" and "unovershadowing" appear to improve learning efficiency in their own right. At the heart of classical conditioning experiments is the presentation of a stimulus (eg. a light, tone, etc.) or combination of stimuli followed by a reward outcome (reward, punishment, or none). When a stimulus is repeatedly presented and there is no change in the reward outcome, latent inhibition [2] sets in, reducing the associability of the stimulus when the change in reward outcome eventually occurs. This promotes association to novel stimuli, which seems appropriate since novel stimuli are more likely to predict a new outcome than familiar stimuli. Latent inhibition saves the additional experience otherwise needed to make this distinction clear. Recovery from overshadowing, or "unovershadowing" [3] is one of a family of similar strategies. First, overshadowing is the process of presenting a compound stimulus followed by, say, reward (SAB → R). Although the compound will learn the full reward value, its constituent stimuli (SA and SB ) tested separately will also increase in value, where the most salient stimulus (say SB ) gains the most value. In unovershadowing, the most salient stimulus is presented but not rewarded (SB → 0) and will naturally lose some of its value. What is surprising, however, is that the absent stimulus (SA ) concurrently increases in value. This allows the animal to not only learn that SB is less rewarding than it predicted but, by process of elimination, learns that SA is more rewarding than it predicted. Unovershadowing saves the need to present and reward SA explicitly to increase its value, taking advantage of implicit logic. Whether it is generalization, latent inhibition, or unovershadowing, learning the value-function from fewer experiences will assist the animal in making rewarding choices sooner. These and other RL strategies are found in classical conditioning experiments, where subjects maintain an internal value-function, indicating rewardvalue based on the rate of their response (eg. lever presses). Since these biological strategies appear beneficial, a machine learning RL system based on RL structures in the brain may prove effective. After a brief outline of our brain-based model [4] that does value-function approximation, the present work characterizes this VFA to determine its robustness and effectiveness in several classical conditioning tests that are especially relevant to VFAs, whether artificial or biological.
94
2
P. Connor and T. Trappenberg
Striatal Model
The striatum, the input stage of the basal ganglia (BG) brain structure, is a key candidate region on which to base a VFA. The striatum is a convergence point for inputs from all over the brain (specifically, the neocortex [5]), spanning signals of sensation to abstract thought. The majority of striatal neurons project to one another (via axon collaterals) and to other BG nuclei. The synaptic strengths (i.e. weights) of these projection neurons are modulated by dopamine signals [6] (or the lack thereof), where dopamine neuron activity has been linked to the teaching signal of TD learning [1] mentioned earlier. In addition, several neural recording studies suggest that reward-value is encoded in the striatum [7][8][9], although it is not the only area of the brain that has been implicated in the representation of reward-value [10][11][12]. Our striatal model [4] is shown in Figure 1. The excitatory external input represents a real-world feature (eg. colour wavelength, tonal pitch, etc.) by providing a Gaussian activation profile surrounding a specific feature value (eg. Green, 530 nm). This emulates the "tuning curve" input to the striatum from the neocortex. The model is composed as a one-layer, one-dimensional neural network of striatal projection neurons, each excited by a subset of the external inputs and inhibited by a subset of the other projection neurons, as is the case in the striatum (see [5] and [13]). Each neuron is part of either the direct or indirect pathway, the main information processing routes through the BG, where D1 and D2 are their dominant dopamine receptor subtypes respectively. These pathways tend to behave in an opposite sense, where one increases BG output activity while the other decreases it. The output of the model, V (S), becomes the expected value of an external input (state/stimuli), computed as the sum of the direct pathway neuron activity minus the sum of the indirect pathway neuron activity. Finally, the teaching signal can be formulated in the same way as TD error, but, for the simple one-step prediction tasks used in this work, it is only necessary to use the reward prediction error (RPE), the actual reward minus the expected reward (RP E = R − V (S)). A more formal description of the model is provided in Appendix A. An important novel element in our model is the inclusion of modifiable lateral inhibitory connections. Because of these, the neurons compete, partially suppressing one another. Given an arbitrary combination of external inputs, an associated subset of neurons will become more active than the others because their external input weights correlate most with the external input. Many neurons will also be inactive, suppressed below their base activation threshold.
3
Tests, Measures, and Variables
Conventionally, to evaluate a VFA, one might seek to prove that the VFA’s state-values converge for arbitrary state-value maps or seek to test performance on a particular RL task (eg. random walk). Instead, we seek to know how effectively this striatum-based VFA employs certain RL strategies found in classical
Characterizing a Brain-Based Value-Function Approximator
95
Fig. 1. Diagram of the striatal model. External input is shaped as a Gaussian activity profile surrounding a feature value. Probabilistic inputs (finely dashed lines) from external and lateral sources are excitatory (green) and inhibitory (red) respectively, while the modulatory RPE signal can be either (blue). The direct and indirect pathways are expressed in the two populations of neurons, D1 and D2 respectively, whose activities are accumulated to compute the expected value of the input state/stimulus, V (S).
conditioning to update the value-function. This approach puts value-function update strategy first, after which agent actions can be included and convergence proofs and specific RL task comparisons pursued. Also, using classical conditioning tests helps to ascertain whether or not the striatum is responsible for this behaviour. There are a great variety of classical conditioning tests, but to be practical, we limit this to five: two to evaluate state-value mapping accuracy and three to evaluate state-value learning efficiency. The striatum-based VFA was integrated into simulations of these tests, providing results in terms of measures that are defined for each, as described below. During a test, many trials are run, where one trial consists of presenting a state/stimulus (external input) together with its expected reward-value. The entry level test for a VFA is the acquisition of a state-value. What is also important, however, is that other state-values outside of a reasonable generalization window (eg. Yellow in Fig. 1) are relatively unaffected. The acquisition test, then, pairs a state with a reward value, and compares the state-value to a sample of other state-values. We define the acquisition effectiveness measure as EA (S) =
V (S) −
1 ΣM V M i=1
V (S)
(Si )
(1)
and consider that acquisition is observed when the state-value, V (S), is twice that of the other sampled state-values, V (Si ), or EA (S) > 0.5. Twenty trials are run for each acquisition test. Six V (Si ) samples are used for the comparison.
96
P. Connor and T. Trappenberg
Secondly, it is important that a VFA be able to represent a variety of statevalue mappings. Negative patterning is the classical conditioning equivalent of the non-trivial "exclusive-OR" problem, where the subject learns to increase the value of two stimuli, SA and SB , while learning zero-value for the compound stimulus SAB . Here, we will define the negative patterning effectiveness as the difference between the average constituent value and the compound value, normalized by the average constituent value, which can be expressed as EN P (SA , SB , SAB ) =
V (SA ) + V (SB ) − 2V (SAB ) . V (SA ) + V (SB )
(2)
Negative patterning is observed while EN P (SA , SB , SAB ) > 0, that is, while the constituents have a higher value than the compound. One-hundred trials of interleaved presentation of the stimuli and their associated rewards are run for each test. In practical situations, no two experiences are identical, making it critical to generalize state-value learning. Generalization also contributes significantly to learning efficiency, spreading learned value to nearby states under the assumption that similar states are likely to have similar expected reward value. This strategy reduces the amount of state-value sampling necessary to achieve reasonable accuracy. For this test, acquisition is performed for a single feature-value and the reward value computed for 500 equally spaced feature-values. Generalization effectiveness will describe the spread of the value as a weighted standard deviation, where feature values are weighted by their associated reward values, N kV (Sk ) N Σi=1 V (Si )(i − Σk=1 N V (S ) ) Σj=1 j EG (S) = . (3) N Σj=1 V (Sj ) Generalization will be considered observed when the spread of value is at least 10% of the width of the tuning curve input. To further enhance learning efficiency, we consider the phenomena of latent inhibition described earlier. Latent inhibition’s reduction of associability can be achieved by lowering the input salience of the familiar stimulus. This is done manually, here, for our testing purposes but represents a process that lowers salience as a stimulus is repeatedly presented without a change in reward outcome. Then, when the reduced salience stimulus is combined with a fully salient, novel stimulus and followed by reward, overshadowing will result. Thus, our test of latent inhibition becomes a test of overshadowing, where the novel stimulus (SA ) overshadows the reduced (half) salience stimulus (SB ). We define the latent inhibition effectiveness measure as ELI (SA , SB ) =
V (SA ) − V (SB ) , V (SA ) + V (SB )
(4)
where the effect is observed when ELI (SA , SB ) > 0. Thirty trials are run for each test.
Characterizing a Brain-Based Value-Function Approximator
97
Finally, unovershadowing appears to improve learning efficiency by process of elimination as described previously. There are other similar phenomena (eg. backward blocking) that raise or lower the value of the absent stimulus, depending on the scenario. The unovershadowing effectiveness is defined as EUO (SA , SB ) = −
ΔV (SA ) , ΔV (SB )
(5)
where ΔV (SX ) is the change of value of stimulus SX from one trial to the next and observability occurs when EUO (SA , SB ) > 0. Here, unovershadowing is simulated by first performing the process of overshadowing (see above) with equally salient stimuli, followed by 100 trials of SB presentation without reward. Ultimately, we seek to determine the robustness of the simulation of these five tests to changes in the VFA’s parameters. Because the parameter space is very large and a full search unnecessary for our purposes, we found initial values where all tests were observed and varied the parameters independently through their valid ranges. This process characterizes the VFA, showing the conditions under which the tests break down. Besides parameters associated directly with the VFA there are others acknowledged here that are better associated with the particular RL task to be solved. To simulate input noise, Gaussian noise is added to the external input and rectified, where its standard deviation is the parameter varied in the tests. Since the intensity of stimuli and rewards may vary, the salience of inputs and rewards are multiplied by parameters varied between 0.01 and 1.
4
Results
Figures 2 and 3 represent the results for all five tests over 17 parameters. Since the VFA connectivity and initial weights are randomly initialized, each test and parameter combination was run 20 times to provide uncertainty estimates. The observability curve (upper panel) for each parameter is a summary of the more detailed effectiveness curves (lower panel). For each parameter, the observability curves from the five tests are multiplied together, giving an "intersection" of observability. So, wherever observability is zero, it means that at least one test is not observed for that parameter setting and when observability is one, all tests are observed. For example, once the lateral learning rate, β, becomes negative, unovershadowing effectiveness disappears (goes negative) and unovershadowing is no longer observed. So, the summary observability curve is zero for β < 0 because not all of the tests were observed in this range. In constrast, when β > 0, all tests are observed. The effectiveness curves, whose vertical bars denote standard deviation, are colour coded: acquisition (blue), negative patterning (green), generalization (red), latent inhibition (cyan), and unovershadowing (violet). Note that only the effectiveness of observed cases are given. Also, when part of an effectiveness curve is missing in the graph, this indicates that there
98
P. Connor and T. Trappenberg
were no cases of the associated parameter values where the effect was observed. In the effectiveness graphs, a black dotted vertical line indicates that parameter’s setting while the other parameters were independently varied. Like any other parameter, the external input learning rate, α, was meant to remain static while other parameters were varied. However, the true effects of the tuning curve width, input noise, activation exponent (θ2 in Appendix A), and the number of inputs per feature were not readily observable through their valid feature range unless α was adjusted so that the system activity was not too small nor too great. For each of these parameters, a function, α = f (param), was chosen such that the acquisition test would acquire full reward value at around the 20 trial mark.
5
Discussion
The results show that this VFA is generally robust to changes in feature values. There are, however, regions where observability disappears within its valid parameter range. From the causes of low observability and key trends in effectiveness curves given in the results, the structural and functional details necessary to successfully reproduce these five effects are described. Acquisition was prevented in only three cases2 : high activation threshold (θ0 in Appendix A), low input salience, and low mean input weight. As the activation threshold increases, fewer neurons are active because fewer have internal activations that exceed it. Likewise, internal activations are weak when either the input salience or the mean input weight is too low. Since learning only occurs in neurons that are active (see equations 8 and 9, Appendix A), neither acquisition nor any other test will learn when neurons are silent. Acquisition is otherwise robust to varying the VFA parameters. This is not surprising since the RescorlaWagner model [14] of classical conditioning acquires reward value in much the same way, the key ingredient being that they both learn in proportion to RPE and input salience. As different inputs are presented to the system it becomes clear that the subset of active neurons is input specific, enabling inputs to be represented by separate populations of neurons. A lateral inhibitory network put forth by Rabinovich et al. [15] similarly showed that asymmetric lateral connectivity (implemented in the striatum-based VFA by low lateral connection probability) led to similar input-specific patterns of activation as well. This form of activity also resembles that of sparse coarse coding [16], another value-function approximation technique that uses state-specific subsets of elements to represent state-value. This value-encoding strategy is critical for negative patterning because it allows a compound stimulus (SAB ) and its constituent stimuli (SA and SB ) to be represented in different (although overlapping) populations. Then SA and SB can have a strong positive value while SAB holds zero value. In the results we see 2
Note that for input noise, tuning curve width, and the activation exponent, testing was terminated when parameter values led to instability in the model despite the custom tuning of the learning rate, α, to avoid this.
Characterizing a Brain-Based Value-Function Approximator
99
Fig. 2. Intersection of observability curves (top) with effectiveness curves (bottom), where error bars represent the standard deviation of effectiveness. Effectiveness curves are coloured according to test: acquisition (blue), negative patterning (green), generalization (red), latent inhibition (cyan), and unovershadowing (violet). See the electronic version (www.springerlink.com) for coloured plots.
100
P. Connor and T. Trappenberg
Fig. 3. Intersection of observability curves (top) with effectiveness curves (bottom), where error bars represent the standard deviation of effectiveness. Effectiveness curves are coloured according to test: acquisition (blue), negative patterning (green), generalization (red), latent inhibition (cyan), and unovershadowing (violet). See the electronic version (www.springerlink.com) for coloured plots.
Characterizing a Brain-Based Value-Function Approximator
101
negative patterning sometimes failing for high lateral connection probabilities. In this scenario, we find that it becomes difficult to separately represent the constituent and compound stimuli because there is too much overlap between their active subsets. Generalization, like acquisition, is robust, not being eliminated except when all neurons are silent. In the effectiveness curves, the generalization is always greater than or equal to the tuning curve width. As the tuning curve width is increased, a proportional increase in generalization effectiveness can be seen as well. When the generalization effectiveness is greater than the tuning curve width, closer examination reveals it to be either noise or an average increase/decrease in the state-values outside a reasonable generalization window. So, the actual generalization present in the VFA is due to the activity profile of the input rather than anything in the VFA per se. The VFA does support this means generalization, however, in that the amount of subset overlap between two feature values is proportional to the overlap between their activity profiles. This, too, accords with the approach taken by sparse coarse coding. Again, the practical benefit of latent inhibition is its ability to reduce association of familiar, ineffectual stimuli with reward outcome. We implemented this as a test of overshadowing, where the familiar stimulus was given half the salience of a novel stimulus. If reward associations were simply made in proportion to a stimulus’ input salience, as is the case for the Rescorla-Wagner model (not shown), our tests should return latent inhibition effectiveness values of ∼0.6. However, we see effectiveness values typically between 0.85 and 0.95, which seems to suggest that the novel stimulus really dominates the association and the familiar stimulus receives disproportionately little association. As mentioned earlier, however, this lateral inhibitory model of the striatum has competitive properties. It appears that this makes up the difference in the effectiveness measure, where the familiar (less salient) stimulus is not very competitive and is overwhelmed by background activity when presented alone. Unovershadowing is especially affected by the lateral learning rate. A sharp increase in unovershadowing observability occurs as the lateral learning rate becomes positive. In agreement with equations 8 and 9 (Appendix A), this suggests that for unovershadowing to be observed, a neuron’s lateral weights must be strengthened when its input weights are strengthened, and be weakened when they are weakened. This is unusual since, if gradient descent had been used to derive the lateral weight update equation as was done for the input weight update equation, the lateral weights would have learned in the opposite sense (i.e. would have been strengthened when input weights were weakened, etc.).
6
Conclusions
Systematically varying the VFA parameters led to both assessing the model’s degree of robustness and helping to determine how the VFA is capable of successfully performing the tests. This striatum-based VFA has shown to effectively express the chosen classical conditioning tests over a breadth of parameter space, supporting the notion that the striatum may be the seat of general purpose
102
P. Connor and T. Trappenberg
reward-value encoding in the brain. The VFA’s ability to effectively demonstrate unovershadowing and support latent inhibition is especially worthy of note, as emergent properties of the competitive nature and lateral learning in the VFA.
7
Future Work: Application to RL Tasks
We have characterized a brain-based VFA in terms of classical conditioning tests that represent RL strategies for accurate and efficient value-function updates. This approach is not limited to brain-based VFAs, but may be applied to others with the assumption that these tests represent RL strategies worth emulating. How might this striatal model be applied to RL tasks? We propose that the striatal model be employed within the actor-critic [17] framework. Unchanged, the model would implement the critic, receiving all sensory inputs (i.e. features such as X and Y position in a Grid World task). The actor, taught by the critic, would be composed of a number of striatal models, one per action (eg. North, South, East, West). Given that biological reinforcement systems are effective beyond simple (eg. Grid World) reinforcement tasks, our approach may support the completion of complex tasks, warranting further investigation.
References 1. Schultz, W.: Predictive Reward Signal of Dopamine Neurons. J. Neurophysiol. 80(1), 1–27 (1998) 2. Lubow, R.E.: Latent inhibition. Psychological Bulletin 79, 398–407 (1973) 3. Matzel, L.D., Schachtman, T.R., Miller, R.R.: Recovery of an overshadowed association achieved by extinction of the overshadowing stimulus. Learning and Motivation 16(4), 398–412 (1985) 4. Connor, P.C., Trappenberg, T.: Classical conditioning through a lateral inhibitory model of the striatum (2011) (in preparation) 5. Wilson, C.J.: Basal Ganglia, 5th edn., pp. 361–413. Oxford University Press, Inc., Oxford (2004) 6. Wickens, J.R., Begg, A.J., Arbuthnott, G.W.: Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience 70, 1–5 (1996) 7. Hori, Y., Minamimoto, T., Kimura, M.: Neuronal encoding of reward value and direction of actions in the primate putamen. Journal of Neurophysiology 102(6), 3530–3543 (2009) 8. Lau, B., Glimcher, P.W.: Value representations in the primate striatum during matching behavior. Neuron 58(3), 451–463 (2008) 9. Samejima, K.: Representation of Action-Specific reward values in the striatum. Science 310(5752), 1337–1340 (2005) 10. Bromberg-Martin, E.S., Hikosaka, O., Nakamura, K.: Coding of task reward value in the dorsal raphe nucleus. Journal of Neuroscience 30(18), 6262–6272 (2010) 11. Gottfried, J.A.: Encoding predictive reward value in human amygdala and orbitofrontal cortex. Science 301(5636), 1104–1107 (2003) 12. Roesch, M.R.: Neuronal activity related to reward value and motivation in primate frontal cortex. Science 304(5668), 307–310 (2004)
Characterizing a Brain-Based Value-Function Approximator
103
13. Wickens, J.R., Arbuthnott, G.W., Shindou, T.: Simulation of GABA function in the basal ganglia: computational models of GABAergic mechanisms in basal ganglia function. In: Progress in Brain Research, vol. 160, pp. 313–329. Elsevier, Amsterdam (2007) 14. Rescorla, R.A., Wagner, A.R.: A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In: Black, A.H., Prokasy, W.F. (eds.) Classical Conditioning II. Appleton-Century-Crofts, New York (1972) 15. Rabinovich, M.I., Huerta, R., Volkovskii, A., Abarbanel, H.D.I., Stopfer, M., Laurent, G.: Dynamical coding of sensory information with competitive networks. J. Physiol. (Paris) 94, 465–471 (2000) 16. Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Advances in Neural Information Processing Systems, vol. 8, pp. 1038–1044. MIT Press, Cambridge (1996) 17. Houk, J., Adams, J., Barto, A.: A Model of How the Basal Ganglia Generate and Use Neural Signals that Predict Reinforcement, pp. 249–270. The MIT Press, Cambridge (1995)
Appendix A Formally, the striatum-based neural network can be represented as:
τ
du(x, t) = −u(x, t) + dt
y
r(u) =
wI (x, y)I(y, t)dy −
wL (x, z)r(u(z, t))dz
(6)
z θ
θ1 (u − θ0 ) 2 , 0
u > θ0 , otherwise
(7)
where wI and wL are the synaptic weights connecting external input (I(y, t)) and lateral inputs from other neurons respectively. The activation function, r(u), transforms the internal state (average membrane potential) to an instantaneous population firing rate. Parameter θ0 is the x-intercept, θ1 is the slope multiplier, and θ2 is the exponent (r(u) is a threshold-linear activation function when θ2 = 1). Neurons only activate if their internal state is greater than the threshold. Learning in the model happens in two ways. Weights receiving external inputs learn according to gradient descent, minimizing the squared RPE (J = 12 RP E 2 ), resulting in wI (x, y) = wI (x, y) + αD(x)RP E θ2 θ1 (u(x, t) − θ0 )θ2 −1 I(y, t) , (8) where α is the learning rate and D(x) = 1 for direct pathway neurons and −1 for indirect pathway neurons. The weights receiving lateral inputs learn in an a way that opposes the gradient, wL (x, z) = wL (x, z) + αβD(x)RP E θ2 θ1 (u(x, t) − θ0 )θ2 −1 Q(u(z, t)) , (9) where β is the relative learning rate for the lateral input connections, and Q(u) = 1 for u > θ0 and 0 otherwise. Just as for r(u), there is no weight change for either of these learning equations when u(x, t) < θ0 .
Answer Set Programming for Stream Reasoning Thang M. Do, Seng W. Loke, and Fei Liu Dept. of CSCE, La Trobe University, Australia [email protected], {S.Loke,F.Liu}@latrobe.edu.au
Abstract. This paper explores Answer Set Programming (ASP) for stream reasoning with data retrieved continuously from sensors. We describe a proof-of-concept with an example of using declarative models to recognize car on-road situations. Keywords: ASP, stream reasoning, semantic sensor based applications.
1
Introduction
A new concept of “stream reasoning” has been proposed in [8]. Recently, dlvhex, an extension of ASP, has been introduced as one candidate for rule-based reasoning for the Semantic Web [6]. dlvhex uses the semantic reasoning approach which makes it fully declarative and always terminating. dlvhex can deal with uncertain data (via using disjunctions rules to generate multiple answer sets), interoperate with arbitrary knowledge bases (to query data) and different reasoning frameworks (e.g., higher-order logic, for more reasoning power). However, to our knowledge, using dlvhex and ASP for stream reasoning is new. Our research has three aims i) to introduce a prototype of dlvhex stream reasoning, ii) to formalize ASP for building stream reasoning systems, and iii) to further apply Semantic Web techniques (OWL) for sensor-based applications. The contribution is to propose a framework and theoretical formulation for building ASP-based stream reasoning systems with a focus on sensor stream applications. There has been research to apply ASP in wireless sensor networks applications such as home-based health care services for aged people [7] or dealing with ambiguous situations [4], but the concept of stream reasoning was not introduced. To implement declarative stream processing, the logic framework of LarKC [5] basaed on aggregate functions (rather than logic programming semantic) of the C-SPARQL [3] language. The logic framework in [1] reasons with a stream of pairs of RDF triples and a timestamp but the ability to deal with unstable data was not mentioned. Therefore, we see the necessity to have a foundation for ASP-based stream reasoning and to investigate its feasibility.
2
ASP-Based Stream Reasoning: A Conceptual Model
In this section, we describe a conceptual model that formalizes ASP-based stream reasoning that processes streams of data into answer sets. We describe: i) a C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 104–109, 2011. c Springer-Verlag Berlin Heidelberg 2011
Answer Set Programming for Stream Reasoning
105
general abstract architecture of a stream reasoning system, ii) a formal model of data streams, and iii) a formalization of an ASP based stream reasoner. Abstract Architecture. A stream reasoning system has three main components, which are sensor system, data stream management system (DSMS) [2] and stream reasoner illustrated in Figure 1 (SSN for Semantic Sensor Network).
Fig. 1. Simple stream reasoning system
Notation. We introduce the notation which is used in the next two sections. - dr: is the time period between the starting time and the finishing time of a reasoning process which is always terminates. - ds: is the time period between the starting time and the finishing time of a sensor taking a data sample (usually very small). - Δs: is the time period between the two start times of taking two consecutive data samples of a sensor. The sample rate fs is: fs = 1/Δs. - Δr: is the time period between the two start times of two consecutive reasoning processes of the reasoner. The reasoning rate fr is: fr = 1/Δr. There are two communication strategies between the DSMS and the stream reasoner: push and pull. In the pull method, when the reasoner needs sensor data sample(s), it sends a query to the DSMS which will perform the query and return the data sample(s) to the reasoner. In the push method, the reasoner registers with the DSMS the sensor name from which it wants to have the data sample. The DSMS returns to the reasoner the data sample whenever it is available. We use the pull method in our prototype to discover the maximum reasoning speed of the reasoner when continuously running as fast as possible. Data Stream Formalization. This section introduces the formalization of the data stream provided to the stream reasoner. The time when a sample is taken is assumed to be very close to the time when that sample is available for reasoning, otherwise the reasoner will give its result with a consistent delay. Definition 1 (Data Stream). Data stream DS is a sequence of sensor data samples di ordered by timestamps. DS = {(t1 , dt1 ), (t2 , dt2 ), . . . , (ti , dti ), . . .} where dti is the sensor data sample taken at time ti , and t1 < t2 < . . . < ti < . . .. Definition 2 (Data Window). A data window available at time t, Wt , is a finite subsequence of a data stream DS and has the latest data sample taken at time t. The number of data samples, |Wt |, of this subset is the size of the window. For Wt ⊆ DS, and ts = t: Wt = {(t1 , dt1 ), (t2 , dt2 ), . . . , (ts , dts )} where Wt is data window at time t, s = |Wt |: is the size of the window, t1 < t2 < . . . < ts , ts is the time when the latest sample of the data window is taken, and dti (1 ≤ i ≤ s) is the sensor data sample taken at time ti .
106
T.M. Do, S.W. Loke, and F. Liu
The data window can also be defined by a time period, for example, a data window that includes all data samples taken in the last 10 seconds. Definition 3 (Window Slide Samples). Window slide samples l is the number of samples counted from the latest sample (inclusive) of one data window to the latest sample (exclusive) of the next data window. Definition 4 (Window Slide Time). Given two continuous data windows Wt1 at time t1 and Wt2 at time t2 (t2 ≥ t1 ), the time period between t1 and t2 is called window slide time Δw, i.e. we have Δw = t2 − t1 . From Definition 3 we can calculate the window slide time with the formula: Δw = l ∗ Δs. When we use the term “window slide”, it means window slide samples or window slide time depending on context. Definition 5 (Data Window Stream). Given a data stream DS, a data window stream W S is a sequence of data windows W in time order. W S = {(t1 , Wt1 ), (t2 , Wt2 ), . . . , (ti , Wti ), . . .} where Wti is a data window at time ti , t1 < t2 < . . . < ti < . . ., and Wti ⊆ DS. In dlvhex, we use &qW to represent a predicate which query a data window from a DSMS. This predicate is extended from the external atoms of dlvhexdlplugin1 : &qW [|W |, U RI, sn](X, V ) where &qW is an external predicate that queries a data window from the DSMS, |W | (input) is window size, U RI (input) is a Unique Resource Identifier or the file path of the OWL ontology data source, sn (input) is the name of the sensor providing data sample, X (output) is the name of the returned instance of the ontology class that describes the sensor, and V (output) is data sample value returned. ASP-Based Stream Reasoner. This section introduces a formalization of the stream reasoner of a system model that has one data stream and one reasoner. This is easily extendible to models that have: one data stream providing data for multiple reasoners, one reasoner using data from multiple data streams, and many reasoner using data from multiple data streams. Definition 6 (Data Window Reasoner). An ASP-based data window reasoner AW R is a function that maps every data window W ⊆ DS to a set SA of answer sets. AW R : WS → 2Σ where AW R denotes a ASP-based data window reasoner, WS is the set of all data windows from data stream DS, Σ denotes the set of all possible answer sets S for any input, and 2Σ is the power set of Σ. The reasoner AW R has input data window W and gives a set SA of answer sets: AW R(W ) = SA, where SA = {S1 , S2 , . . . , Sn }, and Si ∈ Σ, (1 ≤ i ≤ n). When using pull communication, the reasoner AW R runs continuously with an interval of Δr, queries input data (not waitting for the data comming like in push method) from a data window stream W S, and gives a stream of sets of 1
http://www.kr.tuwien.ac.at/research/systems/dlvhex/download.html
Answer Set Programming for Stream Reasoning
107
answer sets SA: &aSR(AW R, W S, Δr) = {SAt1 , SAt2 , . . . , SAti , . . .} &aSR is the meta operator (or external predicate in dlvhex) that triggers the reasoner AW R to run continuously, Δr is the interval at which the meta operator &aSR repeatedly executes AW R, and SAti is the set of answer sets output at times ti . From Definition 6, we have AW R(Wti ) = SAti , where ti is the time when the reasoner gives the output, and ti is the time when the input data becomes available. The input data was available before the reasoning process, so: ti < ti − dr . The reasoner use the latest data window, so: ti = max(tj ), t0 ≤ tj < (ti −dr ) where tl (l ≥ 0) is the time when a data window Wtl becomes available.
3
Prototype Implementation and Experimentation
Prototype Implementation. As a proof-of-concept of the model introduced in Section 2, we built a prototype to detect driving situations of a car travelling in public traffic conditions. The system has main components illustrated in Figure 2 and uses models of car situations (e.g., turning left or right) as constraints on sensor data values, defined declaratively as dlvhex programs.
Fig. 2. Prototype design
The sensor system is built using the SunSPOT tool kit version 4.02 . We build a simple ontology called Sensor Ontology in the OWL language. Sensor data is placed in a queue in an OWL file which is fed to RacerPro version 1.9.03 . This setup simulates a DSMS mentioned in Section 2. The reasoner AW R is built from a dlvhex program4 and we use a Unix shell script to realize the meta predicate &aSR. The prototype is installed in Ubuntu 9.10 which runs on a Sun virtual box version 3.1.6 on our Windows Vista Fujitsu Lifebook T 4220. Experimental Setup. We attach the sensor kit in the middle of a car with the sampling rate of the accelerometer was 0.3s/sample, as the maximum speed of the reasoner. AcceX and AcceZ are the horizontal and vertical direction respectively; AcceY is along the forward direction. The AcceX and AcceY values are mapped to the scale [0, 100]. A AcceX data sample of a data window is created as below: 2 3 4
http://www.sunspotworld.com http://www.racer-systems.com http://www.kr.tuwien.ac.at/research/systems/dlvhex/download.html
108
T.M. Do, S.W. Loke, and F. Liu
55
To query for a data window, we use the dlvhex atom &dlDR[U RI, a, b, c, d, Q] (X, Y ): url("../ontology/driving.owl"). acceX1(X,Y) :- &dlDR[U,a,b,c,d,"acceValue"](X,Y), X="sensorOnto:AcceX1", url(U).
We have to query every single data sample in the data window. With our proposed formula (&qW [|W |, U RI, sn](X, V )) we only use one rule to obtain the most recent data window (of size five): acceX(X, Y) :- &qW[5, U, ‘‘sensorOnto:AcceX’’](X, Y).
The code below, which is a declarative model of a right turn situation, reasons to recognize “right turn” situations. It implements an ASP-based window reasoner defined conceptually earlier. % right turn: doingRightTurn :- acceX1(X1,Y1), acceX2(X2,Y2), acceX3(X3,Y3), acceX4(X4,Y4), acceX5(X5,Y5), #int(S1), #int(S2), #int(S3), #int(S4), #int(Y1), #int(Y2), #int(Y3), #int(Y4), #int(Y5), S1=Y1+Y2, S2=S1+Y3, S3=S2+Y4, S4=S3+Y5, S4>277.
The bounds (e.g., 277) were obtained via experiments done. Similar rules model other car on-road situations. Because we used a UNIX shell script to trigger the reasoner continuously, the Operating System has to repeatedly load dlvhex, run it, and then unload it. This is resource consuming and can reduce reasoning speed, but provided a fast, though crude implementation of the meta-predicate, adequate for reasoning about car on-road behaviours. In dlvhex, using rules with disjunctive heads can give several possible answer sets representing several possible situations given the same sensor data readings. doingRightTurn v doingLeftTurn :- acceX1(X1,Y1), acceX2(X2,Y2), ....
Results and Evaluation. The maximum reasoning speed of the system is nine (three) times/s without (with) querying ontology data. We tested the system in two running states (normal and delayed) when our car’s speed range is 25-50km/h for straight driving and 25-40km/h for turning, turning angles approximately greater than 30o , with three data window sizes (one, two and five). The system recognizes turning situations at higher accuracy with higher speed. With window size five, the system detected 15 left turns and 15 right turns with no error. With window size two, the system detected 10 left turns and 10 right turns with no error. With window size one, the system is very sensitive and often mis-recognizes because accelerometer sensor’s data fluctuates. When the reasoner ran with a small deliberate delay, with window size five, the system detected 10 left turns and 10 right turns with six errors. With window size two, the system detected eight left turns and eight right turns with seven errors. This result means that, using smaller data window sizes makes the system more sensitive and reduces accuracy, but it can more quickly recognize fine-grained situations also different fine-grained car manoeuvres such as start turning, doing
Answer Set Programming for Stream Reasoning
109
turning, and finish turning. When using larger data window sizes, the system returns more accurate results and deals better with unstable sensor data. Overall, our system detects turning, stopping, going straight and going over a ramp with high accuracy. The bound values in the rules can be adjusted to change the sensitivity of the system. We could use machine learning to process such sensor data, but our aim is to illustrate a simply proof-of-concept of ASPbased stream reasoning where the stream comprises a sequence of time-stamped OWL objects. This prototype suggests the potential for applications that require up to three reasoning processes per second such as driving assistant.
4
Conclusion and Future Work
This paper has provided a conceptual model of ASP-based stream reasoning, and showed the feasibility of stream reasoning with dlvhex for semantic sensor based applications. This project successfully used OWL objects to represent sensor data (which is more general than even time-stamped RDF triples) and utilized dlvhex to reason with this data. Our future work will: (i) implement repeated reasoning within dlvhex programs themselves to improve the system’s performance, and ii) research hybrid ASP-machine learning for stream reasoning.
References 1. Barbieri, D., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: Incremental reasoning on streams and rich background knowledge. In: Aroyo, L., Antoniou, G., Hyv¨ onen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6088, pp. 1–15. Springer, Heidelberg (2010) 2. Barbieri, D., Braga, D., Ceri, S., Valle, E.D., Huang, Y., Tresp, V., Rettinger, A., Wermser, H.: Deductive and inductive stream reasoning for semantic social media analytics. IEEE Intelligent Systems 25(6), 32–41 (2010) 3. Barbieri, D.F., Braga, D., Ceri, S., Valle, E.D., Grossniklaus, M.: C-sparql: Sparql for continuous querying. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1061–1062. ACM, New York (2009) 4. Buccafurri, F., Caminiti, G., Rosaci, D.: Perception-dependent reasoning and answer sets (2005), http://www.ing.unife.it/eventi/rcra05/articoli/BuccafurriEtAl.pdf 5. Della Valle, E., Ceri, S., Barbieri, D.F., Braga, D., Campi, A.: A first step towards stream reasoning. In: Domingue, J., Fensel, D., Traverso, P. (eds.) FIS 2008. LNCS, vol. 5468, pp. 72–81. Springer, Heidelberg (2009) 6. Eiter, T., Ianni, G., Schindlauer, R., Tompits, H.: dlvhex: A prover for semantic-web reasoning under the answer-set semantics. In: IEEE / WIC / ACM International Conference on Web Intelligence, pp. 1073–1074 (2006) 7. Mileo, A., Merico, D., Pinardi, S., Bisiani, R.: A logical approach to home healthcare with intelligent sensor-network support. Comput. J. 53, 1257–1276 (2010) 8. Della Valle, E., Ceri, S., van Harmelen, F., Fensel, D.: It’s a streaming world! reasoning upon rapidly changing information. IEEE Intelligent Systems 24(6), 83–89 (2009)
A Markov Decision Process Model for Strategic Decision Making in Sailboat Racing Daniel S. Ferguson and Pantelis Elinas The University of Sydney Australian Centre for Field Robotics Sydney, Australia [email protected]
Abstract. We consider the problem of strategic decision-making for inshore sailboat racing. This sequential decision-making problem is complicated by the yacht’s dynamics which prevent it from sailing directly into the wind but allow it to sail close to the wind following a zigzag trajectory towards an upwind race marker. A skipper is faced with the problem of sailing the most direct route to this marker whilst minimizing the number of steering manoeuvres that slow down the boat. In this paper, we present a Decision Theoretic model for this decision-making process assuming a fully observable environment and uncertain boat dynamics. We develop a numerical Velocity Prediction Program (VPP) which allows us to predict the yacht’s speed and direction of sail given the wind’s strength and direction as well as the yacht’s angle of attack with respect to the wind. We specify and solve a Markov Decision Process (MPD) using our VPP to estimate the rewards and transition probabilities. We also present a method for modelling the wind flow around landmasses allowing for the computation of strategies in realistic situations. Finally, we evaluate our approach in simulation showing that we can estimate optimal routes for different kinds of yachts and crew performance.
1
Introduction
Sailing is both a recreational activity enjoyed by many as well as a competitive, team sport. In this paper, we focus on the latter considering the problem of making strategic decisions for inshore yacht racing. Figure 1 shows an example of the course sailed by a yacht (J24 one-design keelboat) in Sydney harbor during a competitive race consisting of upwind and downwind legs. The data was collected using an off-the-shelf GPS device. We can see that for the downwind leg, the yacht can sail on a straight line between the markers. For the upwind leg, the yacht follows a zigzag course because it is constrained in its ability to sail directly into the wind. The closest a yacht can sail into the wind varies depending on its design and it can be as close as 30 degrees. In order to reach the windward mark, the yacht’s skipper must perform a maneuver known as tacking which turns the bow of the boat through the wind
Corresponding author.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 110–121, 2011. c Springer-Verlag Berlin Heidelberg 2011
A MDP Model for Strategic Decision Making in Sailboat Racing
111
Fig. 1. Example trajectory sailed for a race in Sydney harbor. Data collected using a consumer-level GPS sensor.
slowing it down as a result. The goal of a racing boat is to traverse the length of the course as quickly as possible (more accurately, faster than all the other boats) so a skipper must sail where the wind is strongest while minimizing the number of tacking manoeuvres which slow down the boat. On average, the wind’s strength and direction are expected to remain constant during the race with temporary fluctuations around a given mean. Wind gusts, i.e., local and temporary wind fluctuations, are common and either help the boat advance forward faster or slow it down; similarly, landmasses effect the wind by changing its direction and strength. In such an environment, skippers are faced with a difficult sequential decision-making problem for which we provide a decision theoretic solution as described in this paper. The rest of this paper is structured as follows. In Section 2, we review previous work on weather routing. In Section 3, we introduce the basics of the physics of sailing and our implementation of a Velocity Prediction Programme (VPP) essential for our MDP model described in Section 4. In Section 5, we introduce a method for modelling the effects of landforms on the wind flow. We evaluate our method by simulating different scenarios in Section 6. We conclude and discuss future work in Section 7.
2
Previous Work
In the past, researchers have placed much emphasis in understanding the physics of sailing and the creation of numerical Velocity Prediction Programs along with weather routing algorithms for offshore sailing. The physics of sailing today are well understood using the theory of fluid mechanics [1]. Based on these physics, over the years several researchers have developed methods for real-time, numerical yacht performance prediction differing only in the initial assumptions and modelling of different yacht parameters [7] [9].
112
D.S. Ferguson and P. Elinas
These numerical VPPs have been used to develop race modelling programs (RMPs) for predicting the results of races. For example, Fernandez et. al. [4] present the development of one such RMP that determines routes using graph search. Philpott et. al. [6] develop a model for predicting the outcome of a yacht match race between two yachts. They assume that a fixed strategy is given and focus on the stochastic modelling of wind phenomena and the accurate simulation of yacht dynamics. Another similar approach for match prediction is presented by Roncin et. al. [8]. Stelzer et. al. [10] present a method for short term weather routing for a small robotic boat assuming fixed wind conditions and focused on reaching a given target position with no guarantees for trajectory optimality with respect to minimizing the time to reach the goal as desired in competitive sailing. The work closest to our own is that of Philpott et. al. [5] who also model the stochastic, sequential decision-making problem as a Markov Decision Process but they assume that the boat dynamics are deterministic and focus entirely on modeling wind phenomena without, however, taking into account the effect of landmasses in and around the course.
3
Numerical Velocity Prediction Program (VPP)
A sailboat moves under the influence of wind striking its sails and the water current striking its keel. The physics of sailing are understood well enough to allow the prediction of a yacht’s velocity given the wind conditions, i.e., strength and direction, and a considerably large set of yacht parameters including sail area, boat length and shape, keel size and shape. The main principles of lift and drag that explain how a plane flies apply to a sailboat. In part (a) of Figure 2, we show a simplified diagram of the aerodynamic and hydrodynamic forces acting upon a yacht. In this document, we will not provide a detailed derivation of these forces but the reader can find the details in [3],[11]. Briefly, the main aerodynamic forces are given by the following equations. The aerodynamic lift is given by, SFA =
1 2 ρv Cl 2 AW
(1)
where ρ is the air density; vAW is the apparent wind speed; and Cl is the lift coefficient which is a function of the angle of attack, heel angle, sail trim and area. The aerodynamic drag is given by the same equation but with Cl replaced by Cd which is the drag coefficient. The induced aerodynamic resistance due to lift is given by, Ri =
A 2 ( SF cosφ )
1 2 ρvAW AπRe 2
(2)
where ρ is the air density; SFA is the aerodynamic side force; φ is the heel angle; vAW is the apparent wind speed; A is the area of the sail; and Re is the effective aspect ratio.
A MDP Model for Strategic Decision Making in Sailboat Racing
113
Yacht Parameters
Wind Conditions
Aerodynamic Forces Adjust yacht parameters Hydrodynamic Forces
Equilibrium reached?
NO
YES
Outputs: Yacht velocity, leeway angle
(a)
(b)
Fig. 2. (a) The different forces acting on the yacht including SFH , SFA the hydrodynamic and aerodynamic sideways forces; β the leeway angle; βAW the apparent wind angle; βT W the true wind angle; vAW the boat velocity in the direction of the apparent wind; vT W the boat velocity in the direction of the true wind, and (b) the flow chart for numerical VPP calculations.
There are similar equations for the hydrodynamic forces acting on the yacht’s keel and these can be found in [3],[11]. We note that for the work presented in this paper, we ignore a number of other forces such as wave resistance as well as the effect of other boats sailing in close proximity. Part (b) of Figure 2 shows the basic structure of our numerical VPP. The VPP takes as input the yacht parameters and wind data necessary for evaluating the aerodynamic and hydrodynamic forces given by the equations listed above. Using an iterative procedure, the VPP searches for those values for the yacht speed and leeway angle that perfectly balance all the forces. These estimated values are the VPP output. We emphasize that the output of this program is the maximum speed of the boat for the given conditions; we do not take into account the fact that the boat needs to slowly accelerate to this speed.
4
Markov Decision Process Model
We tackle our sequential decision making problem using a Markov Decision Process (MDP) [2] with discrete states and actions. An MDP is a tuple {S, A, P r, R}, where S is a finite set of states and A is a finite set of actions. Actions induce stochastic state transitions, with P r(s, a, s ) denoting the probability with which state s is reached when action a is executed at state s. R(s) is a real-valued reward function, associating with each state s its immediate utility R(s). Solving an MDP is finding a mapping from states to actions, π(s). Solutions are evaluated based on an optimality criterion such as the expected total reward.
114
D.S. Ferguson and P. Elinas
An optimal solution π (s) is one that achieves the maximum over the optimality measure, while an approximate solution comes to within some bound of the maximum. We use the value iteration algorithm to compute an optimal, infinitehorizon policy, with expected total discounted reward as the optimality criterion. In the following sections, we give the details of our sailing MDP model. 4.1
States
For our problem, we have discretized the state space considering 3 features which are the boat’s x, y position and tack (port or starboard). We can vary the resolution of the grid to compute policies at different levels of detail trading off between accuracy and computation time. 4.2
Actions
There is only one action available to the skipper and that is changing the boat’s tack from port to starboard and vice versa. So, for our MDP model there are 2 actions available at any state, A = {donothing, tack}. Executing a tack action is associated with a penalty because it forces the boat to turn through the wind and slow down. 4.3
Transition Function
During the race, the yacht is always making progress towards the goal (it would make no sense to do otherwise) but its dynamics are uncertain. The MDP transition function capturing the stochastic nature of the yacht’s motion in time is the hardest component of our model that we must specify. There are 3 main reasons why the location of the yacht in the future cannot be predicted exactly, 1. The yacht does not move in a straight line in the direction its bow is pointing because of the net sideways force induced by the wind and water current. An approximate value of the leeway angle is one of the two outputs of our VPP described in Section 3. 2. The wind is only on average constant. Over the course wind gusts effect the boat’s ultimate location. 3. We assume that the boat is always sailing at an angle closest to the wind. Realistically, even the most experienced skippers and crews cannot maintain such an accurate heading. So, we need a way to compute the probability of the boat landing in any one of the squares in the next column in the grid satisfying the above requirements. One might be tempted to use a Gaussian distribution centered around the predicted boat location taking the leeway angle into account. However, this will be incorrect considering that unless a tacking action is performed, the yacht is less likely (because of basic physics) to reach any of the cells on the windward side compared to those on the leeward side of the boat. As a result, we find that the Gamma distribution for estimating transition probabilities is a better choice,
A MDP Model for Strategic Decision Making in Sailboat Racing
g(τ |α, δ) =
1 δ α Γ (α)
τ α−1 e
−τ δ
115
(3)
where Γ (α) = (α − 1)! and the two parameters α (shape) and δ (scale) have values greater than 0. We can estimate the probability of landing within each grid cell via the cumulative distribution function.
0.07 1 degree 5 degrees 8 degrees 12 degrees
0.06
Probability
0.05
0.04
0.03
0.02
0.01
0 −50
0
50
100
Displacement (m)
(a)
(b)
Fig. 3. (a) The Gamma distribution for different values of the leeway angle, and (b) example of the transition probabilities calculated for a leeway angle of 0o and the wind coming from the Southeast at a 45o angle to the boat.
However, in order for us to use this distribution for specifying the transition probabilities of our MDP, we must give functions for determining its two parameters α and δ. The skew of the PDF is determined by the leeway angle of the yacht β. The shape parameter α relates to the skewness γ as follows: 1 1 γ= √ ⇔α= 2 α γ
(4)
For small leeway angles we wish to have small skew, and to achieve this a large shape variable is required. We would also like skewness to roughly increase linearly with the leeway angle. A formulation that achieves both of these requirements is α = 3 + β12 . The scale parameter δ is important as it determines the scaling of the PDF on the x and y axis. We require that the distribution has an increasingly large tail for increasingly large leeway angles. The scale parameter δ is a function of the distribution variance σ and the shape parameter α given by, σ σ = αδ 2 ⇔ δ = (5) α For our work, we decided on the value of the variance σ empirically given by σ = σset + 10βσset
(6)
116
D.S. Ferguson and P. Elinas
where σset = 5 is the variance of the distribution for a zero leeway angle. Examples of the gamma distribution used to compute the transition model for our MDP for different values of the leeway angle are shown in part (a) of Figure 3. Part (b) of Figure 3 shows an example of the estimated transition probabilities for a 0o leeway angle and the wind blowing from the Southeast at a 45o degree angle to the boat. 4.4
Rewards
We set the reward for the goal state to a large positive value. In addition, we specify the reward for each state as function of the boat’s velocity determined using the VPP developed in Section 3 for the given wind conditions. The immediate reward for the boat at location (x, y) and a given tack is given by, R(s) =
vb (s) − vbmax vbmax
(7)
where, vb (s) gives the boat’s predicted speed for state s, and vbmax is an upper limit on the boat speed. This upper limit exists for all displacement boats due to the physics of sailing and it is known as hull speed [1]. We note that the reward function is scaled to give values in the range [−1, 0] such that maximum reward is assigned for those states where the yacht sails the fastest, i.e., closest to vbmax .
5
Modelling the Effects of Landmass on the Wind
The phenomena involved in wind patterns and in modelling these are extremely complex and well beyond the scope of this paper. In general, it is difficult to find wind models for the small scales we are dealing with in inshore sailing, as the majority of weather modelling research goes into large offshore wind patterns. Having said that, we will examine the results of a wind pattern survey around a bay in California [12] and extrapolate some effects from them. Carefully observing the wind data shown in [12], we see that a landmass can be adequately modelled as an ellipse. In addition,there are four important aspects of note in terms of how wind flows around it. These are, – The wind flows in a continuous manner around the landform, and somewhat mimics its shape. – The effects of the landmass decrease the further away from it. – In the region behind the landmass the wind is greatly reduced. – In the region in front of the landmass wind velocity increases (this effect is not modelled in this paper). Based on the above observations, we represent landmasses using ellipses allowing us to model different size landmasses with ease keeping the calculations of wind flow around them relatively simple. So, a landmass is defined as L = [lx , ly , sx , sy ] where (lx , ly ) denote the center of the ellipse and (sx , sy ) the lengths along the x and y axis respectively.
A MDP Model for Strategic Decision Making in Sailboat Racing
117
Fig. 4. Elliptical model of landmass and wind flow around it
First the region over which the effect of the landmass is seen is set at twice f f the size of the landmass, that is, sef = 2sx and sef = 2sy . Outside this region x y there is no effect on the wind. Now that the region that is affected by the land is known, we must consider the features of the wind pattern that have been noted, i.e., continuous flow, that mimics the shape of the landmass and decreasing effect of landmass on the wind the further away from the land. In order to mimic the landform (i.e. flow around it) we first set the wind in the x direction to arbitrarily be xwind = 1. Now the wind component in the y direction is required, and this is dependent on the shape of the land, as we wish the curvature of the wind field to follow that of the land. As can be seen in Figure 4, the larger the size of the land mass in the y direction (for the same size along x) the larger the y s component of wind velocity is needed. Mathematically, it is ywind ∝ syx . The next feature required is the dependence on distance from the land mass. The further away from the landform, the less it will be affected by it, so the wind in the y direction related to the distance as ywind ∝ 1.0 d where d is the radial difference between the current location and the edge of the landform. In order to calculate the difference, we require an equation for the radius of the landform which is the standard equation for the radius of an ellipse: sx sy rland (ω) = (sx sinω)2 + (sy cosω)2
(8)
where ω is the angle between the radius and the x axis. We can now define d(x, y, ω) = (x − lx )2 + (y − ly )2 − rland (ω). Combining all the above relationships, we can derive an equation for estimating the wind strength in the y f f direction for the affected region, i.e., lx − sef < x < lx + sef x x , as ywind (x, y, sx , sy , ω) =
sy x−sef f sin(π sefxf sx x
d(x, y, ω)
) (9)
and the wind direction is given by αwind = tan−1
ywind xwind
(10)
Figure 6 shows an example of the wind flow around 4 landmasses estimated using our elliptical model.
118
6
D.S. Ferguson and P. Elinas
Experimental Evaluation
We present simulation results supporting our approach for strategic decision making in yacht racing in Figures 5 and 6 as well as Table 1. Given our model, we know that the policies computed using the value iteration algorithm will be optimal1 so in this section we qualitatively and quantitatively evaluate the computed policies in simulation and discuss the performance of different boats. 30
0 goal tack penalty: 0.0 tack penalty: 0.2 tack penalty: 0.5
25
−0.1 −0.2 −0.3
20 −0.4 −0.5
15
−0.6 10
−0.7 −0.8
5 −0.9 5
10
15
20
25
30
−1
Fig. 5. Example trajectories sailed in simulation for 3 different tack penalties for the same boat. The color of each square indicates the reward for the the boat sailing in that area such that 0 is the highest reward and −1 the lowest as described in Section 4.4. For these experiments the wind is blowing from the East at 18 ± 4 meters per second.
Figure 5 shows the trajectories of a single simulation for the same boat but 3 different penalties for executing the tacking manoeuvre. The results shown were computed for a course size 3000 × 3000 meters at a resolution of 100 × 100 meters per grid cell; for all experiments, policy computation requires less than 10 seconds of compute time on a laptop computer. We generated a random wind pattern with wind strength of 18 ± 4 meters per second from the East. We can see from the figure that when the tacking penalty is large, the optimal policy avoids executing the tacking manoeuvre unless absolutely necessary. For a smaller tacking penalty, the policy tries to take the boat through as many high wind locations as possible in order to minimize the time travelled towards the goal; we note that the boat will sail the fastest where the breeze is the strongest considering that we assume the boat is always pointing at the optimum angle to the wind. We see that when using a reasonable value for penalizing the tacking 1
We used the Matlab MDP toolbox for computing policies, downloaded from http://www.inra.fr/internet/Departements/MIA/T/MDPtoolbox/
A MDP Model for Strategic Decision Making in Sailboat Racing
119
manoeuvre, the trajectory sailed resembles that shown earlier in Figure 1 which is based on real data. Table 5 shows the average and standard deviation for the time and number of tacks it takes in each case to traverse the distance to goal computed for 1000 simulations over the same course but for a higher grid resolution of 25 × 25 meters. We also consider 2 different boats that can sail at different angles closest to the wind one being 40o and the other 30o . We can easily compute the time to goal since using our VPP we have an estimate for the boat’s top speed as it passes through each of the grid cells. Table 1. Timing results over 1000 simulations for the same boat and wind data but using 5 different tack penalty values and 2 different angles of attack α = 40o α = 30o Tack penalty Mean Time (s) Mean Tacks Mean Time (s) Mean Tacks 0 858 ± 8 51 ± 5 806 ± 12 50 ± 6 0.2 886 ± 10 21 ± 4 831 ± 13 14 ± 4 0.5 933 ± 11 8±3 864 ± 14 5±4 0.7 959 ± 14 6±3 868 ± 15 3±2 1.0 995 ± 18 2±1 878 ± 57 2±1
2 4 6 8 10 12 14 16 18 20
2
4
6
8
10
12
14
16
18
20
Fig. 6. Simulation result with landforms showing advantageous wind formation due to landform presence allowing for a more direct route towards the goal at (20, 10). The course is 2000 × 2000 meters large with grid resolution of 100 × 100.
As we would expect, the closer a boat can sail to the wind and the more efficiently it can change tacks the faster it can traverse the length of the course. However, we can also see from the second and fourth rows of the Table 1 that
120
D.S. Ferguson and P. Elinas
a boat that is not capable of sailing close to the wind but can perform steering manoeuvres efficiently can be as competitive as a boat that can sail much closer to the wind but is inefficient in its steering. Finally, in Figure 6 we show a trajectory sailed by a boat for an experiment involving landmasses. The size of this course is 2000 × 2000 meters and the resolution of the grid is 100 × 100 meters per grid cell. The yacht can sail at 40o closest to the wind. We want to point out that the landmass warps the wind such that the yacht can sail a more direct route to the goal as can be clearly seen. Most experienced sailors know how land effects the direction and strength of the prevailing wind and try to take advantage during the race; we see that our automated system with proper modelling of the effect of land on the wind can do the same, at least, at a basic level.
7
Conclusions and Future Work
In this paper, we presented our method for strategic decision making for inshore yacht racing. We started with an introduction of the physics of sailing and the development of a numerical Velocity Prediction Program (VPP). We also presented a way for modelling wind flow around landforms. Finally, we modeled the sequential decision-making problem using a Markov Decision Process which we solved using value iteration and evaluated in simulation for different yacht parameters. Our model allowed us to compare boats that can sail at different angles closest to the wind as well as different crew performances in the execution of the most important steering manoeuvre known as tacking. We have also shown that when we have knowledge of how the wind flows around a landmass, our model can correctly decide on routes that take advantage of the situation. In future work we would like to develop a more accurate method for velocity prediction perhaps not based on numerical estimation but derived from real data collected by sailing a yacht in different weather conditions. Our strategic decision making approach would be useful to crews for making plans before the race, but since the weather conditions can vary during the race, it would be desirable to augment our system with the ability to revise plans accordingly using data gathered during the competition. Lastly, we would like to test our approach in a real yacht race and also extend it to offshore sailing in which case we would have to put much emphasis in more accurately modelling large wind patterns over time.
Acknowledgements The authors would like to thank the reviewers for their many valuable comments and suggestions. This work is supported by the ARC Centre of Excellence programme funded by the Australian Research Council (ARC) and the New South Wales Government.
A MDP Model for Strategic Decision Making in Sailboat Racing
121
References 1. Anderson, B.D.: The physics of sailing. Physics Today 61(2), 38–43 (2008) 2. Boutilier, C., Dean, T., Hanks, S.: Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research 11, 1–94 (1999) 3. Ferguson, D.S.: Strategic decision making in yacht racing. Master’s thesis, The University of Sydney, Sydney, Australia (2010) 4. Fernandez, A., Valls, A., Garcia-Espinosa, J.: Stochastic optimization of IACC yacht performance. In: International Symposium on Yacht Design and Production, pp. 69–78 (2004) 5. Philpott, A., Mason, A.: Optimising yacht routes under uncertainty. In: In Proc. of the 15th Chesapeake Sailing Yacht Symposium, Annapolis, MD (2001) 6. Philpott, A.B., Henderson, S.G., Teirney, D.: A simulation model for predicting yacht match race outcomes. Oper. Res. 52, 1–16 (2004) 7. Philpott, A.B., Sullivan, R.M., Jackson, P.S.: Yacht velocity prediction using mathematical programming. European Journal of Operational Research 67(1), 13–24 (1993) 8. Roncin, K., Kobus, J.: Dynamic simulation of two sailing boats in match racing. Sports Engineering 7, 139–152 (2004), 10.1007/BF02844052 9. Roux, Y., Huberson, S., Hauville, F., Boin, J., Guilbaud, M., Ba, M.: Yacht performance prediction: Towards a numerical VPP. In: High Performance Yacht Design Conference, Auckland, New Zealand (December 2002) 10. Stelzer, R., Pr¨ oll, T.: Autonomous sailboat navigation for short course racing. Robot. Auton. Syst. 56, 604–614 (2008) 11. van Oossanen, P.: Predicting the speed of sailing yachts. SNAME Transactions 101, 337–397 (1993) 12. Vesecky, J.F., Drake, J., Laws, K., Ludwig, F.L., Teague, C.C., Paduan, J.D., Sinton, D.: Measurements of eddies in the ocean surface wind field by a mix of single and multiple-frequency hf radars on monterey bay california. In: IEEE Int. Geoscience and Remote Sensing Symposium, pp. 3269–3272 (July 2007)
Exploiting Conversational Features to Detect High-Quality Blog Comments Nicholas FitzGerald, Giuseppe Carenini, Gabriel Murray, and Shafiq Joty University of British Columbia {nfitz,carenini,gabrielm,rjoty}@cs.ubc.ca
Abstract. In this work, we present a method for classifying the quality of blog comments using Linear-Chain Conditional Random Fields (CRFs). This approach is found to yield high accuracy on binary classification of high-quality comments, with conversational features contributing strongly to the accuracy. We also present a new corpus of blog data in conversational form, complete with user-generated quality moderation labels from the science and technology news blog Slashdot.
1
Introduction and Background
As the amount of content available on the Internet continues to increase exponentially, the need for tools which can analyze and summarize large amounts of text has become increasingly pronounced. Traditionally, most work on automatic summarization has focused on extractive methods, where representative sentences are chosen from the input corpus ([5]). In contrast, recent work (eg. [10], [2]) has taken an abstractive approach, where information is first extracted from the input corpus, and then expressed through novel sentences created with Natural Language Generation techniques. This approach, though more difficult, has been shown to produce superior summaries in terms of readability and coherence. Several recent works have focused on summarization of multi-participant conversations ([9], [10]). [10] describes an abstractive summarization system for face-to-face meeting transcripts. The approach is to use a series of classifiers to identify different types of messages in the transcripts; for example, utterances expressing a decision being made, or a positive opinion being expressed. The summarizer then selects a set of messages which maximize a function encompassing information about the sentences in which messages appear, and passes these messages to the NLG system. In this paper, we present our work on detecting high-quality comments in blogs using CRFs. In future work, this will be combined with classification on other axes—for instance that of the message’s rhetorical role (ie. Question, Response, Criticism etc.)—to provide the messages for an abstractive summarization system. CRFs ([7]) are a discriminative probabilistic model which have gained much popularity in Natural Language Processing and Bio-informatics applications. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 122–127, 2011. c Springer-Verlag Berlin Heidelberg 2011
Exploiting Conversational Features to Detect High-Quality Blog Comments
123
One benefit of using linear chain CRFs over more traditional linear classification algorithms is that the sequence of labels is considered. Several works have shown the effectiveness of CRFs on similar Natural Language Processing tasks which involve sequential dependencies ([1], [4]). [11] uses Linear-Chain CRFs to classify summary sentences to create extractive summaries of news articles, showing their effectiveness on this task. [6] test CRFs against two other classifiers (Support Vector Machines and Naive-Bayes) on the task of classifying dialogue acts in livechat conversations. They also show the usefulness of structural features, which are similar to our conversational features (see Sect. 2.3).
2
Automatic Comment Rating System
2.1
The Slashdot Corpus
We compiled a new corpus comprised of articles and their subsequent user comments from the science and technology news aggregation website Slashdot1 . This site was chosen for several reasons. Comments on Slashdot are moderated by users of the site, meaning that each comment has a scores from -1 to +5 indicating the total score of moderations assigned, with each moderator able to modify the score of a given comment by +1 or -1. Furthermore, each moderation assigns a classification to the comment: for good comments, the classes are: Interesting, Insightful, Informative and Funny. For bad comments, the classes are: Flamebait, Troll, Off-Topic and Redundant. Since the goal of this work was to identify high-quality comments, most of our experiments were conducted with comments grouped into GOOD and BAD. Slashdot comments are displayed in a threaded conversation-tree type layout. Users can directly reply to a given comment, and their reply will be placed underneath that comment in a nested structure. This conversational structure allows us to use Conversational Features in our classification approach (see Sect. 2.3). Some comments were not successfully crawled, which meant that some comments in the corpus referred to parent comments which had not been collected. In order to prevent this, comments whose parents were missing were excluded from the corpus. After this cleanup, the collection totalled 425,853 comments on 4320 articles. 2.2
Transformation into Sequences
As mentioned above, Slashdot commenters can reply directly to other comments, forming several tree-like conversation for each article. This creates a problem for our use of Linear-Chain CRFs, which require linear sequences. In order to solve this problem, each conversation tree is transformed into multiple Threads, one for each leaf-comment in the tree. The Thread is the sequence of comments from the root comment to the leaf comment. Each Thread 1
http://slashdot.org
124
N. FitzGerald et al.
is then treated as a separate sequence by the classifier. One consequence of this is that any comment with more than one reply will occur multiple times in the training or testing set. This makes some intuitive sense for training, as comments higher in the conversation tree are likely more important to the conversation as a whole, as the earlier a comment appears in the thread the greater effect it has on the course of conversation down-thread. We describe the process of remerging these comment threads, and investigate the effect this has on accuracy, in Sect. 3.3. 2.3
Features
Each comment in a given sequence was represented as a series of features. In addition to simple unigram (bag-of-words) features, we experimented with two other classes of features: lexical similarity, and conversational features. These are described below: Similarity Features. Three features were used which capture the lexical similarity between two comments: TF-IDF, LSA ([5]) and Lexical Cohesion([3]). For each comment, each of these three scores was calculated for both the preceding and following comment (0 if there was no comment before or after), giving a total of six similarity features. These features were previously shown in [12] to be useful in the task of topic-modelling in email conversations. However, in contrast to [12], where similarity was calculated between sentences, these metrics were adapted to calculate similarity between entire comments. Conversational Features. The conversational features capture information about the how the comment is situated in the conversation as a whole. The list is as follows: ThreadIndex The index of the comment in the current thread (starting at 0). NumReplies The number of child comments replying to this WordLength and SentenceLength The length of this comment in words and sentences, respectively. AvgReplyWordLength and AvgReplySentLength The average length of replies to this comment in words and sentence length. TotalReplyWordLenth and TotalReplySentLength The total length of all replies to this comment in words and sentence length. 2.4
Training
The popular Natural Language Machine Learning toolkit MALLET2 was used to train the CRF model. A 1000-article subset of the entire Slashdot corpus was divided 90%-10% between the training and testing set. The training set consisted of 93,841 Threads from 900 articles, while the testing set consisted of 10,053 Threads from 100 articles. 2
http://mallet.cs.umass.edu/index.php
Exploiting Conversational Features to Detect High-Quality Blog Comments
125
Table 1. (a) Confusion matrix for binary classification of comment threads. (b) Results of feature analysis on the 3 feature classes. (c) Confusion matrix for re-merged comment threads. (a)
(b)
BAD GOOD BAD 5991 1965 GOOD 1426 8814 P: R: F:
3
0.818 0.861 0.839
all good uni sim conv uni sim uni conv sim conv uni sim conv
(c)
P
R
F
0.563 0.708 0.802 0.8183 0.780 0.8183 0.8183 0.8183
1.000 0.699 0.900 0.855 0.847 0.855 0.855 0.855
0.720 0.703 0.848 0.836 0.812 0.836 0.836 0.836
BAD GOOD BAD 4160 GOOD 862 P: R: F:
467 1090 0.700 0.558 0.621
Experimental Results
3.1
Classification
Experiment 1 was to train the CRF using data where the full set of moderation labels had been grouped into GOOD comments and BAD. The Conditional Random Field classifier was trained on the full set of features presented in Sect. 2.3. The Confusion-Matrix of this experiment is presented in Table 1a. We can see that the CRF performs well on this formulation of the task, with a precision of 0.818 and recall of 0.839. This compares very favourably to a baseline of assigning GOOD to all comments, which yields a precision score of 0.563. The CRF result also performs favourably against a non-sequential Support Vector Machine classifier (P = .799, R = .773) which confirms the existence of sequential dependencies in this problem. 3.2
Feature Analysis
To investigate the relative importance of the 3-types of features (unigrams, similarity, and conversational) we experiment with training the classifier with different groupings of features. The results of this feature analysis is presented in Table 1b. All three sets of features can provide relatively good results by themselves, but the similarity and conversational features greatly out-perform the unigram features. Similarity features have a slight edge in terms of recall and f-score, while the Conversational features provide the edge in precision, seeming to dominate Similarity features when both are used. In fact, the results of this analysis seem to show that whenever the conversational features are used, they dominate the effect of the other features, since all sets of features which include 3
These results were not identical, though close enough that precision, recall, and f-score were identical to the third decimal point.
126
N. FitzGerald et al.
Conversational features have the same results as using the Conversational features alone. This would seem to indicate that most relevant factors in deciding the quality of a given comment are conversational in nature, including the number of replies it receives and the nature of those replies. This effect could be reinforced by the fact that comments which have previously been moderated as GOOD are more likely to be read by future readers, which will naturally increase the number of comments they receive in reply. However, since the unigram- and, more notably, similarity-features can still perform quite well without use of the conversational features, our method is not overly-dependent on this effect. 3.3
Re-merging Conversation Trees
As described in Sect. 2.2, conversation trees were decomposed into multiple threads in order to cast the problem in the form of sequence labelling. The result of this is that after classification, each non-leaf thread has been classified multiple times, equal to the number of sub-comments of that comment. These different classifications need not be the same, ie. A given comment might well have been classified as GOOD in one sequence and BAD in another. We next recombined these sequences, such that there is only one classification per comment. Comments which appeared in multiple sequences, and thus received multiple classifications, were marked GOOD if they were classified as GOOD at least once (GOOD if |{ci ∈ C : ci = good}| ≥ 1}, where C is the set of classifications of comment i4 . There are two ways to evaluate the merged classifications. The first way is to reassign the newly-merged classifications back onto the thread sequences. This preserves the proportions of observations in the original experiments, which allows us to determine whether merging has affected the accuracy of classification. Doing so showed that there was no significant effect on the performance of the classifier; precision and recall remained .818 and .861, respectively. The other method is to look at the comment-level accuracy. This removes duplicates from the data, and gives the overall accuracy for determining the classification of a given comment. The results of this are given in Table 1c. The precision and recall in this measure are significantly lower than in the threadbased measure, which indicates that the classification of “leaf” comments tended to be less accurate than that of non-leaf comments which subsequently appeared in more than one thread. The precision of .700 is still much greater than the baseline of assigning GOOD to all comments, which would yield a precision of .297. This indicates that our approach can successfully identify good comments.
4
Conclusion and Future Work
In this work, we have presented an approach to identifying high-quality comments in blog comment conversations. By casting the problem as one of binary 4
This was compared to similar metrics such as a majority-vote metric (GOOD if |{ci ∈ C : ci = good}| ≥ |{ci ∈ C : ci = bad}|), and performed the best (though the difference was negligible).
Exploiting Conversational Features to Detect High-Quality Blog Comments
127
classification, and applying sequence tagging by way of a Linear-Chain Conditional Random Field, were were able to achieve high accuracy. Also presented was a new corpus of blog comments, which will be useful for future research. Future work will focus on refining our ability to classify comments, and incorporating this into an abstractive summarization system. In order to be useful for this task, it would be preferable to have finer-grained classification than just GOOD and BAD. Applying our current method to the full range of Slashdot moderation classes yielded low accuracy5 . Future work will attempt to address these issues.
References 1. Chung, G.: Sentence retrieval for abstracts of randomized controlled trials. In: BMC Medical Informatics and Decision Making, vol. 9, p. 10 (2009) 2. FitzGerald, N., Carenini, G., Ng, R.: ASSESS: Abstractive Summarization System for Evaluative Statement Summarization (extended abstract), The Pacific Northwest Regional NLP Workshop (NW-NLP), Redmond (2010) 3. Galley, M., McKeown, K., Fosler-Lussier, E., Jing, H.: Discourse segmentation of multi-party conversation. In: 41st Annual Meeting on Association for Computational Linguistics, Stroudsburg, vol. 1 (2003) 4. Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M.: Identifying Sections in Scientific Abstracts using Conditional Random Fields. In: Third International Joint Conference on Natural Language Processing, Hyderabad, pp. 381–388 (2008) 5. Jurafsky, D., Martin, J.: Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall, Upper Saddle River (2009) 6. Kim, S., Cavedon, L., Baldwin, T.: Classifying dialogue acts in one-on-one live chats. In: 2010 Conference on Empirical Methods in Natural Language Processing Cambridge (2010) 7. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001) 8. McCallum, A.: MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu 9. Murray, G., Carenini, G.: Summarizing Spoken and Written Conversations. In: 2008 Conference on Empirical Methods in Natural Language Processing, Waikiki (2008) 10. Murray, G., Carenini, G., Ng, R.: Generating Abstracts of Meeting Conversations: A User Study. In: International Conference on Natural Language Generation (2010) 11. Shen, D., Sun, J., Li, H., Yang, Q., Chen, Z.: Document Summarization using Conditional Random Fields. In: International Joint Conferences on Artificial Intelligence (2007) 12. Joty, S., Carenini, G., Murray, G., Ng, R.: Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails. In: The Conference on Empirical Methods in Natural Language Processing, Cambridge (2010) 5
A longer version of this paper with the report of this experiment is available from the first author’s website.
Consolidation Using Context-Sensitive Multiple Task Learning Ben Fowler and Daniel L. Silver Jodrey School of Computer Science Acadia University Wolfville, NS, Canada B4P 2R6 [email protected]
Abstract. Machine lifelong learning (ML3) is concerned with machines capable of learning and retaining knowledge over time, and exploiting this knowledge to assist new learning. An ML3 system must accurately retain knowledge of prior tasks while consolidating in knowledge of new tasks, overcoming the stability-plasticity problem. A system is presented using a context-sensitive multiple task learning (csMTL) neural network. csMTL uses a single output and additional context inputs for associating examples with tasks. A csMTL-based ML3 system is analyzed empirically using synthetic and real domains. The experiments focus on the effective retention and consolidation of task knowledge using both functional and representational transfer. The results indicate that combining the two methods of transfer serves best to retain prior knowledge, but at the cost of less effective new task consolidation.
1
Introduction
Machine lifelong learning, or ML3, a relatively new area of machine learning research, is concerned with the persistent and cumulative nature of learning [12]. Lifelong learning considers situations in which a learner faces a series of different tasks and develops methods of retaining and using prior knowledge to improve the effectiveness (more accurate hypotheses) and efficiency (shorter training times) of learning. We focus on the learning of concept tasks, where the target value for each example is either zero or one. An ML3 system requires a method of using prior knowledge to learn models for new tasks as efficiently and effectively as possible, and a method of consolidating new task knowledge after it has been learned. Consolidation is the act of saving knowledge, in one data store, in an integrated form such that it can be indexed efficiently and effectively. The challenge for an ML3 system is consolidating the knowledge of a new task while retaining and possibly improving knowledge of prior tasks. This challenge is generally known in machine learning, cognitive science and psychology as the stability-plasticity problem[5]. This paper addresses the problem of knowledge consolidation and, therefore the stability-plasticity problem, within a machine lifelong-learning system using a modified multiple-task learning (MTL) neural network. The system uses C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 128–139, 2011. c Springer-Verlag Berlin Heidelberg 2011
Consolidation Using Context-Sensitive Multiple Task Learning
129
a context-sensitive multiple task learning (csMTL) network as a consolidated domain knowledge (CDK) store. This research extends the work presented in [11], in which a csMTL network is demonstrated as an effective method of using prior knowledge to learn models for new tasks. The goal is to demonstrate that a csMTL network can retain previous task knowledge when consolidating a sequence of tasks from a domain. This will establish the usefulness of csMTL as part of an ML3. This requires showing that an appropriate network structure and set of learning parameters can be found to allow the method to scale with increasing numbers of tasks. The paper has four remaining sections. Section 2 provides the necessary background information on MTL, csMTL and ML3. Section 3 presents a ML3 system based on a csMTL network and discusses its benefits and limitations. Section 4 provides the results of empirical studies that test the proposed system. Finally, Section 5 concludes and summarizes the paper.
2 2.1
Background Inductive Transfer
Typically in machine learning, when a new task is learned, we ignore any previously acquired and related knowledge, and instead start learning from a fresh set of examples. Using previously acquired knowledge is called inductive transfer [2]. Humans generally make use of previously learned tasks to help them learn new ones. Similarly, previously acquired knowledge should be used to bias learning, in order to make machine learning more efficient or more effective. There are two basic forms of inductive transfer: functional transfer and representational transfer [8]. Functional transfer is when information from previously learned tasks is transferred through implicit pressures from training examples. Representational transfer is when information from previously learned tasks is transferred directly through explicit assignment of task representation (such as neural network weights). 2.2
Limitations of MTL for Machine Lifelong Learning
Multiple task learning (MTL) neural networks are one of the better documented methods of inductive transfer of task knowledge [2,8]. An MTL network is a feed-forward multi-layer network with an output for each task that is to be learned. The standard back-propagation of error learning algorithm is used to train all tasks in parallel. Consequently, MTL training examples are composed of a set of input attributes and a target output for each task. The sharing of internal representation is the method by which inductive bias occurs within an MTL network [1]. The more that tasks are related, the more they will share representation and create positive inductive bias. Let X be a set on n (the reals), Y the set of {0, 1} and error a function that measures the difference between the expected target output and the actual output of the network for an example. MTL can be defined as learning a set of target
130
B. Fowler and D.L. Silver
concepts f = {f1 , f2 , . . . fk } such that each fi : X → Y with a probability distribution Pi over X × Y . We assume that the environment delivers each fi based on a probability distribution Q over all Pi . Q is meant to capture regularity in the environment that constrains the number of tasks that the learning algorithm will encounter. Q therefore characterizes the domain of tasks to be learned. An example for MTL is of the form (x, f(x)), where x is the same as defined for STL and f(x) = {fi (x)}, a set of target outputs. A training set SMT L consists of all available examples, SMT L = {(x, f(x))}. The objective of the MTL algorithm is to find a set of hypotheses h = {h1 , h2 , . . . , hk } within its hypothesis space HMT L k that minimizes the objective function x∈SM T L i=1 error [fi (x), hi (x)]. The assumption is that HMT L contains sufficiently accurate hi for each fi being learned. Typically |HMT L | > |HST L | in order to represent the multiple hypotheses. Previously, we have investigated the use of MTL networks as a basis for an ML3 system and have found them to have several limitations related to the multiple outputs of the network [9,10]. First, MTL requires that training examples contain corresponding target values for each task; this is impractical for lifelong learning systems as examples of each tasks are acquired at different times and with unique combinations of input values. Second, with MTL, shared representation and therefore transfer is limited to the hidden node layers and not the output nodes. Third, there is the practical problem of how a MTL based ML3 system would know to associate an example with a particular task. Clearly, the learning environment should provide the contextual queues, however this suggests additional inputs, not outputs. Finally, a lifelong learning system should be capable of practising two or more tasks and improving its models for each as new examples become available. It is unclear how redundant task outputs for the same task could be avoided using an ML3 system based on MTL. In response to these problems, we developed context-sensitive MTL, or csMTL [11]. csMTL is based on MTL with two major differences; only one output is used for all tasks and additional inputs are used to indicate the example context, such as the task to which it is associated. 2.3
csMTL
Figure 1 presents the csMTL network. It is a feed-forward network architecture of input, hidden and output nodes that uses the back-propagation of error training algorithm. The csMTL network requires only one output node for learning multiple concept tasks. Similar to standard MTL neural networks, there is one or more layers of hidden nodes that act as feature detectors. The input layer can be divided into two parts: a set of primary input variables for the tasks and a set of inputs that provide the network with the context of each training example. The context inputs can simply be a set of task identifiers that associate each training example to a particular task. Related work on context-sensitive machine learning can be found in [13]. Formally, let C be a set on n representing the context of the primary inputs from X as described for MTL. Let c be a particular example of this set where c is a vector containing the values c1 , c2 , . . . , ck ; where ci = 1 indicates that
Consolidation Using Context-Sensitive Multiple Task Learning
131
Fig. 1. Proposed system: csMTL
the example is associated with function fi . csMTL can be defined as learning a target concept f : C × X → Y ; with a probability distribution P on C × X × Y where P is constrained by the probability distributions P and Q discussed in the previous section for MTL. An example for csMTL takes the form (c, x, f (c, x)), where f (c, x) = fi (x) when ci = 1 and fi (x) is the target output for task fi . A training set ScsMT L consists of all available examples for all tasks, ScsMT L = {(c, x, f (c, x))}. The objective of the csMTL algorithm is to find a hypothesis h within its hypothesis space HcsMT L that minimizes the objective function, x∈ScsM T L error [f (c, x), h (c, x)]. The assumption is that HcsMT L ⊂ {f |f : C × X → Y } contains a sufficiently accurate h . Typically, |HcsMT L | = |HMT L | for the same set of tasks because the number of additional context inputs under csMTL matches the number of additional task outputs under MTL. With csMTL, the entire representation of the network is used to develop hypotheses for all tasks of the domain. The focus shifts from learning a subset of shared representation for multiple tasks to learning a completely shared representation for the same tasks. This presents a more continuous sense of domain knowledge and the objective becomes that of learning internal representations that are helpful to predicting the output of similar combinations of the primary and context input values. Once f is learned, if x is held constant, c indexes over the hypothesis base HcsMT L . If c is a vector of real-valued inputs and from the environment, it provides a grounded sense of task relatedness. If c is a set of task identifiers, it differentiates between otherwise conflicting examples and selects internal representation used by related tasks. In the following section we propose how csMTL can be used to overcome the limitations of MTL for construction of a ML3 system. The proposed ML3 is described so as to provide motivation for and useful characteristics of csMTL.
3
Machine Lifelong Learning with csMTL Networks
Figure 2 shows the proposed csMTL ML3 system. It has two components; a temporary short-term learning network and a permanent long-term consolidation
132
B. Fowler and D.L. Silver
csMTL network. The long-term csMTL network is the location in which domain knowledge is retained over the lifetime of the learning system. The weights of this network are updated only after a new task has been trained to an acceptable level of accuracy in the short-term learning network. The short-term network can be considered a temporary extension of the long-term network that adds representation (several hidden nodes and a output node, fully feed-forward connected) that may be needed to learn the new task. At the start of short-term learning the weights associated with these temporary nodes are initialized to small random weights while the weights of the long-term network are frozen. This allows representational knowledge to be rapidly transferred from related tasks existing in the long-term network without fear of losing prior task accuracies. Once the new task has been learned, the temporary short-term network is used to consolidate knowledge of the task into the permanent long-term csMTL network. This is accomplished by using a form of functional transfer called task rehearsal [9]. The method uses the short-term network to generate virtual examples for the new tasks so as to slowly integrate (via back-propagation) the task’s knowledge into the long-term network. Additionally, virtual examples for the prior tasks are used during consolidation to maintain the existing knowledge of the long-term network. Note that it is the functional knowledge of the prior tasks that must be retained and not their representation; the internal representation of the long-term network will necessarily change to accommodate the consolidation of the new task. The focus of this paper and the experiments in Section 4 is on the long-term network and the challenge of consolidation. The following discusses the benefits and limitations of the csMTL method as a long-term consolidation network for an ML3 system. 3.1
Long-Term Retention of Learned Knowledge
Knowledge retention in a MTL network is the result of consolidation of new and prior task knowledge using task rehearsal [9]. Task rehearsal overcomes the stability-plasticity problem originally posed by [5] taken to the level of learning sets of tasks as opposed to learning sets of examples [6,4]. A plastic network is one that can accommodate new knowledge. A stable network is one that can accurately retain old knowledge. The secret to maintaining a stable and yet plastic network is to use functional transfer from prior tasks to maintain stable function while allowing the underlying representation to slowly change to accommodate the learning of the new task. Prior work has shown that consolidation of new task knowledge, within an MTL network, without loss of existing task knowledge is possible given: sufficient number of training examples, sufficient internal representation for all tasks, slow training using a small learning rate and a method of early stopping to prevent over-fitting and therefore the growth of high magnitude weights [10]. The same is expected with a csMTL network. In the long-term csMTL network there will be an effective and efficient sharing of internal representation between related tasks, without the MTL disadvantage of having redundant outputs for near identical tasks. Over time, practice sessions
Consolidation Using Context-Sensitive Multiple Task Learning
133
Fig. 2. Proposed system: csMTL
for the same task will contribute to the development of a more accurate longterm hypothesis. In fact, the long-term csMTL network can represent a fluid domain of tasks where subtle differences between tasks can be represented by small changes in the context inputs. The csMTL ML3 approach does have its limitations. It suffers from the scaling problems of similar neural network systems. The computational complexity of the standard back-propagation algorithm is O(W 3 ), where W is the number of weights in the network. Long-term consolidation will be computationally more expensive than standard MTL because the additional contextual inputs will increase the number of weights in the network at the same rate as MTL and it may be necessary to add an additional layer of hidden nodes for certain task domains. The rehearsal of each of the existing domain knowledge tasks requires the creation and training of m · k virtual examples, where m is the number of virtual training examples per task and k is the number of tasks. An important benefit from consolidation is an increase in the accuracy of related hypotheses existing in the csMTL network as a new task is integrated.
4
Experiments
We empirically investigate the conditions to fullfil the long-term domain knowledge requirements of an ML3 system using a csMTL-based CDK network. Our analysis will focus on the effectiveness of prior task retention and new task consolidation into a csMTL network. More specifically, the experiments examine (1) the retention of prior task knowledge as the number of virtual examples for each task varies; (2) the benefit of of combining functional and representational transfer, and (3) the scalability of the method as up to 15 tasks are sequentially consolidated within a csMTL-based CDK.
134
4.1
B. Fowler and D.L. Silver
Task Domains and General Conventions
The Logic 1 task domain is synthetic, consisting of eight tasks. It has 11 realvalued inputs in the range [0, 1]. A positive example for a task is calculated by a logical conjunction of two disjunctions, involving two of the real inputs. For example, the first task, C1 , is defined as (a > 0.5 ∨ b > 0.5) ∧ (c > 0.5 ∨ d > 0.5). For each new task, the set of inputs shifts one letter to the right in the alphabet. The Logic 2 domain is an extension of the Logic 1 domain, consisting of 15 tasks with 18 real value inputs in the range [0, 1]. The Covertype and Dermatology real-world domains were also examined with similar results, but for brevity will not be discussed in this paper [3]. All experiments were performed using RASL3 ML3 system developed at Acadia University and available at ml3.acadaiu.ca. Preliminary experiments investigated the network structure and learning parameters required to learn up to 15 tasks of the Logic domain to a sufficient level of accuracy (greater than 0.75 for all tasks) using real training examples. The following was determined: one layer of hidden nodes is sufficient provided there are at least 17 hidden nodes, so 30 modes are used in all of the following experiments; the learning rate must be small - less than 0.001, and the momentum term can remain at 0.9 for all experiments to speed up learning when possible. Multiple runs of experiments are necessary to determine confidence in the results. For each repetition of an experiment, different training and validation sets are used as well as random initial weights. In all experiments, the primary performance statistic is the accuracy on an independent test set. A hypothesis test of statistical significance between test set accuracy is done for each experiment, specifically the two-tailed student’s t-test. Each sequential learning run (training of up to 15 tasks one after the other) required a lot of time to complete. This limited the number of repetitions to three. It is important to note that examples for the new task being consolidated are duplicated to match the number of virtual examples for each prior task for a run. This is to ensure that all tasks get an equal opportunity to effect the weight updates in the csMTL network. For brevity, only results for certain tasks and the mean of all tasks (including the new task being consolidated) are provided in this paper. A complete set of results from our research can be found in [3]. 4.2
Experiment 1: Impact of Varying Number of Virtual Examples for Prior Tasks
This experiment examines the change in generalization accuracy of previously learned tasks as new tasks are consolidated within a csMTL network. The number of virtual examples for prior tasks varies between runs providing increasingly more task rehearsal. We are interested in how this variation in virtual examples effects the accuracy of the retained models. Methodology: The learning rate is set to 0.001 for all configurations except the 1000 virtual examples per task configuration, where it is lowered to 0.0005, to compensate for the much larger training set. 100 new task examples are used
Consolidation Using Context-Sensitive Multiple Task Learning
135
for each training set (and duplicated as needed to match the number of virtual examples for each prior task) and 100 new task examples are used for each validation set. A combination of functional and representational transfer is used. Three repetitions are made for for each configuration, which consists of learning each of the seven tasks in sequence, using six different numbers of virtual examples. Results and Discussion: Figure 3 shows the accuracy for the second task, C2 , over six consolidation steps under varying amounts of virtual examples. The labels follow a format of < task > − < numberof virtualexamplesusedpertask > and then a letter code. The code indicates the type of transfer, either ’F’ for functional, or ’FR’ for functional plus representational. The figure shows that the accuracy of the consolidated network developed with 300 or more virtual examples retains models better for the C2 task than when only 100 virtual examples are used. A difference of means two-tailed T-test between the 1000 and 100 virtual example models in the last consolidation step confirms this with a p-value of 99.9%. However this result is not consistent for all tasks. More generally, the results indicate that increasing the number of virtual examples for prior tasks slows the loss of knowledge, but not enough to stop it for all tasks, consistently. Some tasks showed signs of plateauing or even improving as more tasks are consolidated when there are sufficient virtual examples, such as C2 . However, later tasks often begin at a smaller base accuracy, causing the mean task accuracy to decline as more tasks are consolidated. This is the problem of stability-plasticity. With the current network configuration, using representational transfer and more virtual examples results in a more stable network for prior tasks, but a less plastic one for consolidating in new tasks. 4.3
Experiment 2: Impact of Transfer Type
Consolidation using functional and representational transfer both have their merits. Without functional transfer, it is difficult to maintain prior task generalization accuracies while new task knowledge is being consolidated. Without representational transfer, the models of prior tasks must essentially be rebuilt, and in rebuilding the models, prior knowledge not being transferred from functional examples may be lost. The transfer type is either functional or functional plus representational. We observe the change in generalization accuracy of previously learned tasks, as well as the change in generalization accuracy of new tasks as a function of the method of transfer and validation set examples. Methodology: The learning rate is set to 0.001 for all configurations. 100 unique new task examples (duplicated to 300) and 300 virtual examples for each prior task are included in each training set. The validation set consists of 100 examples for each included task. Three repetitions are made for for each configuration, which consists of learning each of the seven tasks in sequence, using two different types of transfer.
136
B. Fowler and D.L. Silver
Results and Discussion: The results for second task, C2 , and the average for all tasks are shown in Figure 4. Th graphs demonstrate the effects of using different types of transfer as sequential consolidation occurs. An hypotheses test for task C2 confirms that the functional plus representational transfer approach is superior to the model using only functional transfer with 97.6% confidence. However, the mean task test set accuracy for the last consolidation step does not differ significantly between the methods. The new task model accuracies for functional transfer are greater or equal to the models developed using functional plus representational transfer for all tasks, except C4 . Once again, the results indicate that combing representational and functional transfer provides more effective retention, however, new tasks have an increasingly more difficult time developing accurate consolidated models. This is what causes the mean task accuracy to decline over the sequence. Conversely, the functional transfer method demonstrates better new task consolidation but poor prior task retention. The gains and losses balance out the mean task accuracy as the graph approaches the last consolidation step. 4.4
Experiment 3: Scalability
This experiment examines the performance of the csMTL-based CDK as a large number of Logic 2 domain tasks are learned in sequence. The objective is to to test, under the most optimum of conditions, the scalability of the system to retain accurate prior knowledge while consolidating up to 15 tasks, one after the other. Methodology: The combined functional and representational transfer method is used and compared to single task learning (STL) for each of the tasks. A network of one hidden layer and 30 hidden nodes with a learning rate of 0.0001 is used for all runs. The training sets consist of 300 real examples for each task - no virtual examples are used in this experiment so as to ensure the accurate rehearsal of prior tasks and accurate STL models. The validation set consists of 100 examples for the newest task. Three repetitions are made for for each configuration, which consists of learning each of the 15 tasks in sequence. Results and Discussion: Results for the second task, C2 , shown in Figure 5 demonstrate the method performs quite well in terms of retention over the 15 tasks, with no significant difference between the STL and the retained models. A two-tailed t-test shows no significant difference in mean accuracy (p-value = 0.77) on the final task. However, the graph does shows a small but study decline in prior task knowledge as more tasks are consolidated; even when using real examples for rehearsing prior tasks. Other tasks behave similarly. As seen in the prior experiments, the accuracy of the new consolidated tasks eventually starts to decline. The new task accuracy remains high and steady until task C7 , where there is a significant drop below that of its STL counterpart. The base accuracy for new tasks then continues to fall as more tasks are consolidated into the csMTL network.
Consolidation Using Context-Sensitive Multiple Task Learning
137
Fig. 3. Graph of the C2 test set accuracy over which tasks are consolidated, for varying amounts of virtual examples
Fig. 4. Graph of the C2 and mean test set accuracy for different types of transfer
Fig. 5. Graph of the C2 and mean test set accuracy as many tasks are developed
138
5
B. Fowler and D.L. Silver
Conclusion
This paper has presented a machine lifelong learning system using a contextsensitive multiple task learning-based consolidated domain knowledge network. The long-term motivation in the design of this system is to create a machine that can effectively retain knowledge of previously learned tasks and use this knowledge to more effectively and efficiently learn new tasks. Short-term real world applications include medical classification problems. A MTL-based consolidated domain knowledge network had been explored in previous work, but was found to have limitations preventing its fulfilment of the requirements of a ML3 system. To address these limitations, we have proposed a new ML3 system based on a csMTL-based CDK network that is capable sequential knowledge retention and inductive transfer. A csMTL-based system can focus on learning shared representation, more than an MTL-based system, because all weight values in the csMTL network are shared between all tasks. The system is meant to satisfy a number of ML3 requirements including the effective consolidation of task knowledge into a long-term network using task rehearsal, the accumulation of task knowledge from practice sessions, effective and efficient inductive transfer during new learning. The experiments have demonstrated that consolidation of new task knowledge within a csMTL without loss of prior task knowledge is possible but not consistent for all tasks of the test domain. The approach requires a sufficient number of training examples for the new task and an abundance of virtual training examples for rehearsal of the prior tasks. Also required is sufficient internal representation to support all tasks, slow training using a small learning rate and a method of early stopping to prevent over-fitting and therefore the growth of high magnitude weights. Our empirical findings indicate that representational transfer of prior knowledge, in addition to functional transfer through task rehearsal, improves retention of prior task knowledge, but at the cost of less accurate models for newly consolidated tasks. We conclude that the stability-plasticity problem is not resolved by our current csMTL-based ML3 system. The ultimate goal of this avenue of research is the development of a true machine lifelong learning system, a learning system capable of integrating new tasks and retaining old knowledge effectively and efficiently. Possible future research directions include: 1. Explore the decay of model accuracy for new task consolidation by examining greater numbers of hidden nodes, or the injection of random noise to mitigate the build-up of high magnitude network weights. Recent work on transfer learning using deep hierarchies of features suggests that multiple layers of hidden nodes is worth exploring [7]. 2. Exploit task domain properties to guide the generation of virtual examples. Results on a real-world task domain indicated that successful learning required exploitation of meta-data to construct virtual examples, to more closely match the input distribution of the real data [3].
Consolidation Using Context-Sensitive Multiple Task Learning
139
Acknowledgments. This research has been funded in part by the Government of Canada through NSERC.
References 1. Baxter, J.: Learning model bias. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 169–175. The MIT Press, Cambridge (1996) 2. Caruana, R.A.: Multitask learning. Machine Learning 28, 41–75 (1997) 3. Fowler, B.: Context-Sensitive Multiple Task Learning with Consolidated Domain Knowledge. Master’s Thesis Thesis, Jodrey School of Computer Science, Acadia University (2011) 4. French, R.M.: Pseudo-recurrent connectionist networks: An approach to the sensitivity-stability dilemma. Connection Science 9(4), 353–379 (1997) 5. Grossberg, S.: Competitive learning: From interactive activation to adaptive resonance. Cognitive Science 11, 23–64 (1987) 6. Robins, A.V.: Catastrophic forgetting, rehearsal, and pseudorehearsal. Connection Science 7, 123–146 (1995) 7. Salakhutdinov, R., Adams, R., Tenenbaum, J., Ghahramani, Z., Griffiths, T.: Workshop: Transfer Learning Via Rich Generative Models. Neural Information Processing Systems (NIPS) (2010), http://www.mit.edu/~ rsalakhu/workshop_nips2010/index.html 8. Silver, D.L., Mercer, R.E.: The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. Learning to Learn, 213–233 (1997) 9. Silver, D.L., Mercer, R.E.: The task rehearsal method of life-long learning: Overcoming impoverished data. In: Advances in Artificial Intelligence, 15th Conference of the Canadian Society for Computational Studies of Intelligence (AI 2002), pp. 90–101 (2002) 10. Silver, D.L., Poirier, R.: Sequential consolidation of learned task knowledge. In: 17th Conference of the Canadian Society for Computational Studies of Intelligence (AI 2004). LNAI, pp. 217–232 (2004) 11. Silver, D.L., Poirier, R., Currie, D.: Inductive tranfser with context-sensitive neural networks. Machine Learning 73(3), 313–336 (2008) 12. Thrun, S.: Is learning the nth thing any easier than learning the first? In: Advances in Neural Information Processing Systems 8, vol. 8, pp. 640–646 (1996) 13. Turney, P.D.: The identification of context-sensitive features: A formal definition of context for concept learning. In: 13th International Conference on Machine Learning (ICML 1996), Workshop on Learning in Context-Sensitive Domains, Bari, Italy, vol. NRC 39222, pp. 53–59 (1996)
Extracting Relations between Diseases, Treatments, and Tests from Clinical Data Oana Frunza and Diana Inkpen School of Information Technology and Engineering University of Ottawa, Ottawa, ON, Canada, K1N6N5 {ofrunza,diana}@site.uottawa.ca Abstract. This paper describes research methodologies and experimental settings for the task of relation identification and classification between pairs of medical entities, using clinical data. The models that we use represent a combination of lexical and syntactic features, medical semantic information, terms extracted from a vector-space model created using a random projection algorithm, and additional contextual information extracted at sentence-level. The best results are obtained using an SVM classification algorithm with a combination of the above mentioned features, plus a set of additional features that capture the distributional semantic correlation between the concepts and each relation of interest. Keywords: clinical data-mining, relation classification.
1
Introduction
Identifying semantic relations between medical entities can help in the development of medical ontologies, in question-answering systems on medical problems, in the creation of clinical trials — based on patient data new trials for already known treatments can be created to test their therapeutic potential on other diseases, and in identifying better treatments for a particular medical case by looking at other cases that followed a similar clinical path. Moreover, identifying relations between medical entities in clinical data can help in stratifying patients by disease susceptibility and response to therapy, reducing the size, duration, and cost of clinical trials, leading to the development of new treatments, diagnostics, and prevention therapies. While some research has been done on technical data, text extracted from published medical articles, little work has been done on clinical data, mostly because of lack of resources. The data set that we used is the data released in the fourth i2b2-10 shared-task challenges in natural language processing for clinical data1 , the relation identification track in which we participated.
2
Related Work
The relation classification task represents a major focus for the computational linguistic research community. The domains on which this task was deployed 1
https://www.i2b2.org/NLP/Relations/
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 140–145, 2011. c Springer-Verlag Berlin Heidelberg 2011
Extracting Relations between Diseases, Treatments, and Tests
141
vary wildly, but the major approaches used to identify the semantic relation between two entities are the following: rule-based methods and templates to match linguistic patterns, co-occurrence analysis, and statistical or machinelearning based approaches. Due to space limitation and the fact that our research is focused on the bioscience domain, we describe relevant previous work done in this domain only using statistical methods. Machine learning (ML) methods are the ones that are most used in the community. They do not require human effort to build rules. The rules are automatically extracted by the learning algorithm when using statistical approaches to solve various tasks [1], [2]. Other researchers combined the bag-of-words features extracted from sentences, with other sources of information like part-of-speech [3]. [4] used two sources of information: sentences in which the relation appears and the local context of the entities, and showed that simple representation techniques bring good results. In our previous work presented in [5], we showed that domain-specific knowledge improves the results. Probabilistic models are stable and reliable for tasks performed on short texts in the medical domain. The representation techniques influence the results of the ML algorithms, but more informative representations are the ones that consistently obtain the best results. In the i2b2-shared task competition [6] the system that performed the best obtained a micro-averaged F-measure value of 73.65%. The mean of the F-measure scores of all the teams that participated in the competition was 59.58%
3
Data Set
The data set annotated with existing relations between two concepts in a sentence (if any) focused on 8 possible relations. These relations can exist only between medical problems and treatments, medical problems and tests, and medical problems and other medical problems. These annotations are made at sentence level. Sentences that contain these concepts, but without any relation between them, were not annotated. The training data set consisted in 349 records, divided by their type and provenance, while the test set consisted of 477 records. Table 1 presents the class distribution for the relation annotations in the training and the test data. Besides the annotated data, a number of 827 unannotated records were also released. In order to create training data for the Negative class, a class in which a pair of concepts is not annotated with any relation, we considered sentences that had only one pair of concepts in no relation. This choice yielded in a data set of 1,823 sentences. In the test data set a number of 50,336 pair of concepts was not annotated with a relation. These pairs represent the Negative-class test set. In the entire training data a number of 6,381 sentences contained more than two concepts. In the test data this number raised to 10,437.
142
O. Frunza and D. Inkpen
Table 1. The number of sentences of each relation in the training and test data sets Relation Training Test PIP (medical problem indicates medical problem) 1239 1,989 TeCP (test conducted to investigate medical problem) 303 588 TeRP (test reveals medical problem) 1734 3,033 TrAP (treatment is administered for medical problem) 1423 2,487 TrCP (treatment causes medical problem) 296 444 TrIP (treatment improves medical problem) 107 198 TrNAP (treatment is not administered because of medical problem) 106 191 TrWP (treatment worsens medical problem) 56 143
4
Method Description
Our method is using a supervised machine learning setting with various types of feature representation techniques. 4.1
Data Representation
The features that we extracted for representing the pair of entities and the sentence context use lexical information, information about the type of concept of each medical entity, and additional contextual information about the pair of medical concepts. The bag-of-words (BOW). feature representation uses single token features with a frequency-based representation. ConceptType. The second type of features represents semantic information about the type of medical concept of each entity: problem, treatment, and test. ConText. The third type of feature represents information extracted with the ConText tool [7]. The system is capable to provide three types of contextual information for a medical condition: Negation, Temporality, and Experiencer. Verb phrases. In order to identify verb phrases, we used the Genia tagger2 tool. The verb-phrases identified by the tagger are considered as features. We removed the following punctuation marks: [ . , ’ ( ) # $ % & + * / = < > [ ] - ], and considered valid features only the lemma-based forms of the identified verb-phrases. Concepts. In order to make use of the fact that we know what token or sequence of tokens represents the medical concept, we extracted from all the training data a list of all the annotated concepts and considered this list as possible nominal values for the Concept feature. 2
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
Extracting Relations between Diseases, Treatments, and Tests
143
Semantic vectors. Semantic vector models are models in which concepts are represented by vectors in some high dimensional space. Similarity between concepts is computed using the analogy of similarity or distance between points in this vector space. The main idea behind semantic vector models is that words and concepts are represented by points in a mathematical space, and this representation is learned from text in such a way that concepts with similar or related meanings are near to one another in that space. In order to create these semantic vectors and use them in our experiments we used the Semantic Vectors Package3 [8]. The package uses indexes created by applying a Random Projection algorithm to term-document matrices created using Apache Lucene4 . We used the semantic vectors to extract the top 300 terms correlated with each relation and to determine the semantic distribution of a pair of concepts in the training corpus of all 9 relations.
5
Classification Technique
As classification algorithms, we used the SVM implementation with polynomial kernel from the Weka5 tool. To solve the task, we are using a 9-class classification model, 8 relations of interest and the Negative class, and also a model that uses a voting ensemble of 8 binary classifiers. The ensemble consists of 8 binary classifier focused on one of the relations and the Negative class. We identify the negative test instances when we use the voting ensemble as being the data points that are classified as Negative by all 8 binary classifiers. Once these negative instances are eliminated, we deploy an 8-class classifier to identify the relations that exist between the remaining instances.
6
Results
In this section, we present the results obtained in the competition and postcompetition experimental results. The evaluation metric is micro-averaged Fmeasure. Table 2 presents our results on the test data, both the competition results and the post-competition ones. More details on the competition experiments can be found in [9]. The post-competition experiments were more mostly focused on capturing the semantic correlation between the terms of the pair of concepts and the instances that are contained in each relation. We also tried to capture the verb-phrases overlap between the training and test instances, because these relations evolve around the verbs that are attached to the concept pair. As we can see from 3 4 5
http://code.google.com/p/semanticvectors/ http://lucene.apache.org/java/docs/index.html http://www.cs.waikato.ac.nz/ml/weka/
144
O. Frunza and D. Inkpen Table 2. F-measure results in the competition Competition BOW + Concept + ConceptType + ConText BOW + ConceptType BinaryClassifiers Post-competition SemVect 300 SemVect+VPs+ConceptType BOW + SemVect + VPs + ConceptType BOW + SemVect + VPs + ConceptType + DistSem BOW(context) + ConceptType + VPs + DistSem + VBs
40.88% 40.98% 39.34% 40.49% 44.44% 47.05% 47.53% 86.15%
Table 2, the post-competition results improved the competition results and the best representation technique is the one that uses a combination of BOW, semantic vectors information, type of the concepts, and verb phrases.
7
Discussion and Conclusions
The results obtained in the competition showed that a richer representation better identifies the existing relations. The ensemble of classifiers showed more balance between all the measures. Since the ensemble of classifiers showed promising results in weeding out the negative examples, we run more experiments when using only 8 relations of interest. With this setting, we obtain the best result of 86.15%. In this experiment, we used additional nominal features for each relation containing verbs that are synonyms to the verbs that describe each relation. The value of these features is the number of verbs overlapping with the context of each pair. The contexts consist in all the words all the words between the pair. The features that we used are presented in Table 2. We believe that the results can be further improved by using classifiers that are trained on the relations that exist between a certain type of concepts, e.g., one classifier that is trained only on the relations that exist between medical problems and treatments, etc. Our post-competition results are exceeding the mean results in the competition. As future work, we plan to focus more on adding features that are specific for each concept, reduce the context from sentence level to shorter contexts, look into more verb information, and better understand and incorporate additional information for each relation.
References 1. Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B.: Prebind and textomy: Mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11), 11–24 (2003) 2. Mitsumori, T., Murata, M., Fukuda, Y., Doi, K., Doi, H.: Extracting protein-protein interaction information from biomedical text with svm. IEICE Transactions on Information and Systems 89(8), 2464–2466 (2006)
Extracting Relations between Diseases, Treatments, and Tests
145
3. Bunescu, R., Mooney, R.: shortest path dependency kernel for relation extraction. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 724–731 (2005) 4. Giuliano, C., Lavelli, A., Romano, L.: Exploiting shallow linguistic information for relation extraction from biomedical literature. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 401–409 (2006) 5. Frunza, O., Inkpen, D., Tran, T.: A machine learning approach for identifying disease-treatment relations in short texts. IEEE Transactions on Knowledge and Data Engineering (2010) (in press) 6. Roberts, K., Rink, B., Harabagiu, S.: Extraction of medical concepts, assertions, and relations from discharge summaries for the fourth i2b2/va shared task (2010) 7. Chapman, W., Chu, D., Dowling, J.N.: Context: an algorithm for identifying contextual features from clinical text. In: ACL 2007 Workshop on Biological, Translational, and Clinical Language Processing (BioNLP 2007), pp. 81–88 (2007) 8. Widdows, D., Ferraro, K.: Semantic vectors: a scalable open source package and online technology management application. In: Calzolari, N., (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech (2008), http://www.lrec-conf.org/proceedings/lrec2008/ 9. Frunza, O., Inkpen, D.: Identifying and classifying semantic relations between medical concepts in clinical data, i2b2 challenge (2010)
Compact Features for Sentiment Analysis Lisa Gaudette and Nathalie Japkowicz School of Information Technology & Engineering University of Ottawa Ottawa, Ontario, Canada {lgaud082,njapkow}@uottawa.ca Abstract. This work examines a novel method of developing features to use for machine learning of sentiment analysis and related tasks. This task is frequently approached using a “Bag of Words” representation – one feature for each word encountered in the training data – which can easily involve thousands of features. This paper describes a set of compact features developed by learning scores for words, dividing the range of possible scores into a number of bins, and then generating features based on the distribution of scored words in the document over the bins. This allows for effective learning of sentiment and related tasks with 25 features; in fact, performance was very often slightly better with these features than with a simple bag of words baseline. This vast reduction in the number of features reduces training time considerably on large datasets, and allows for using much larger datasets than previously attempted with bag of words approaches, improving performance.
1
Introduction
Sentiment analysis is the problem of learning opinions from text. On the surface, sentiment analysis appears similar to text categorization by topic, but it is a harder problem for many reasons, as discussed in Pang & Lee’s 2008 survey [11]. First and foremost, with text categorization, it is usually much easier to extract relevant key words, while sentiment can be expressed in many ways without using any words that individually convey sentiment. In topic classification there are undoubtedly red herrings, such as the use of analogies and metaphor, but if a word associated with a given domain is mentioned frequently, it is usually related (although not necessarily the most relevant). However, in sentiment analysis there are many examples of “thwarted expectations” (e.g. “I was expecting this movie to be great, but it was terrible”) and comparison to an entity with opposing sentiment (e.g. “I loved the first movie, but this sequel is terrible”) such that a positive review can easily have many negative words and vice versa [11]. In addition, words that convey positive sentiment in one domain may be irrelevant or negative in another domain, such as the word unpredictable, which is generally positive when referring to the plot of a book or a movie but negative when referring to an electronic device. The Bag of Words (BOW) representation is commonly used for machine learning approaches to text classification problems. This representation involves creating a feature vector consisting of every word seen (perhaps some minimum C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 146–157, 2011. c Springer-Verlag Berlin Heidelberg 2011
Compact Features for Sentiment Analysis
147
number of times) in the training data and learning based on the words that are present in each document, sentence, or other unit of interest. However, this approach leads to a very large, sparse feature space, as words are distributed such that there are a small set of very frequent, not very informative words, and a great deal of individually rarer words that carry most of the information in a sentence. Thousands of features are required to adequately represent the documents for most text classification tasks. Another approach is to learn scores for words, and then use these words to classify documents based on the sum or average of the scores in a document and some threshold. While this type of approach is useful, bag of words based approaches generally perform better, although the two approaches can be combined together through meta classifiers or other approaches to improve on either approach individually. This research proposes a novel method of combining machine learning with word scoring through condensing the sparse features of BOW into a very compact “numeric” representation by using the distribution of the word scores in the document. This approach allows for combining word scores with machine learning in a way that provides much more information than a simple global word score for a document, with substantially fewer features than a BOW approach. Other combination approaches add complexity to the BOW idea, by adding an extra meta-classifier or making the features even more complicated; ours instead uses the results of a word scoring method to make the machine learning step simpler by reducing a feature vector in the thousands to one of about twenty-five.
2
Related Work
This paper combines ideas from two different basic approaches to sentiment analysis. The first is to use a bag of words feature set and train a machine learning classifier such as a Support Vector Machine (SVM) using the BOW features, such as in [10]. The second approach is to learn scores for words and score documents based on those scores, such as in [3]. Some previous attempts to combine these two approaches include combining results from both systems with a meta classifier such as in [8], [6], and [1], and weighting bag of words features by a score, such as the use of TF/IDF in [7]. While there are many techniques to improve on the basic idea of BOW through refining the features or combining it with other approaches, it remains a good basic approach to the problem.
3
Approach to Generating Compact Features
The approach used here involves 3 steps. The first step is to calculate scores for the words, while the second is to represent the documents in terms of the distribution of those word scores. Finally, we run a machine learning algorithm on the features representing the distribution of the word scores. We refer to our features as “Numeric” features.
148
3.1
L. Gaudette and N. Japkowicz
Learning Word Scores
The first step to this approach involves learning word scores from the text. We initially considered three different supervised methods of scoring words, and found that Precision performed best. This method was inspired by its use in [12] for extracting potential subjective words. It represents the proportion of occurrences of the word which were in positive documents, but does not account for differences in the number of words in the sets of positive and negative documents. This produces a value between 0 and 1. wP precision = (1) wP + wN wP , wN the number of occurrences of word w in positive (negative) documents In order to calculate the precision, we first go through the training data and count the number of positive and negative instances of each word. We then compute the scores for each word. As a word which appears very few times could easily appear strongly in one category by chance, we chose to only use words appearing at least 5 times. This produces a list of word scores customized to the domain of interest. 3.2
Generating Features from Scored Words
In order to generate features from these scored words, we first divide the range of possible scores into a number of “bins”, representing a range of word scores. We then go through each document, look up the score for each word, and increment the count of its corresponding bin. After we have counted the number of words in each bin, we normalize the count by the number of scored words in the document, such that each bin represents the percentage of the words in the document in its range of scores. Example of Generating Features. This section shows an example of scoring a document after we have scored the words, assuming 10 bins and precision word scores ranging from 0 to 1. Figure 1a shows the preprocessed text of the review with word scores, while Figure 1b shows the results of counting the number of words in each bin, and then normalizing those counts based on the number of scored words in the document to generate the features we use for machine learning. After going through this process for a set of documents, we have a set of numeric features based on the distribution of the word scores that is much more compact than the bag of words representation and can be used as input to a machine learning algorithm.
4
Selecting Parameters
There are three main options to this approach – the method for scoring the words, the number of bins to use, and the machine learning algorithm to use.
Compact Features for Sentiment Analysis
0.460 i 0.850 pleased 0.460 i
0.503 have 0.526 with 0.403 would
0.545 0.555 0.576 always been very 0.497 0.449 0.351 the sandisk products 0.898 0.568 0.465 highly recommend them
(a) Review Preprocessed and Annotated with Word Scores
149
Range Count Feature 0.00-0.10 0 0.000 0.10-0.20 0 0.000 0.20-0.30 0 0.000 0.30-0.40 1 0.067 0.40-0.50 6 0.400 0.50-0.60 6 0.400 0.60-0.70 0 0.000 0.70-0.80 0 0.000 0.80-0.90 2 0.133 0.90-1.00 0 0.000 (b) Numeric Features Generated from Review
Fig. 1. Generating Features From a Sample Review
We used two basic datasets to select these options – the reviews of Steve Rhodes from [10] and the 2000 review, balanced, Electronics dataset from [2]. We used these datasets in terms of both ordinal and binary problems, for a total of 4 distinct problems. We examined the effect of three different scoring methods, varying the number of bins, and using a variety of classifiers as implemented in the WEKA machine learning system [13]. While we do not have space to present all of the details here, we found that the precision scoring method performed best on most datasets by a small margin, but that all scoring methods were close. The number of bins only affected performance by a very small amount given enough bins – some datasets performed very well with as few as 10 bins, while others needed 25, and using more bins had no consistent effect on performance beyond that point. The SMO algorithm, WEKA’s implementation of Support Vector Machines (SVM), performed well, while we also found that BayesNet performed nearly as well and was much faster, particularly in the ordinal case. The experiments using the word score based features all use precision scoring with 25 bins. In the Binary case, they use SMO, with default settings except for the option of fitting logistic models to the outputs for Binary problems. For ordinal problems, the BayesNet classifier is used instead. The BOW baseline classifiers are all constructed using SMO with default settings, as SVM has been shown to perform well in previous work. BayesNet did not perform well using the BOW representation.
5
Experiments
In order to evaluate the feasibility of this method, we test it across a range of datasets. In all cases, we compare the results to an SVM classifier using BOW features. We chose to compare against BOW because it is a widely used basic method and we view our approach as effectively compressing the features given
150
L. Gaudette and N. Japkowicz
to a BOW algorithm with a pre-processing step. Where available, we also provide results obtained by the authors who introduced the datasets in order to provide some comparison to a wider range of methods. We evaluate both classifiers using only unigrams that appear at least 5 times in the training data. Many authors working in this domain have simply reported accuracy as an evaluation metric, which has problems in even the binary case For the ordinal problem, we will use Mean Squared Error (MSE), as it was shown in [4] to be a good measure for ordinal problems of this type, while for the binary problem we include AUC. We include accuracy in places to compare with previous work. Where multiple runs were feasible we use 10x10 fold cross validation. Times reported include all time taken to read in and process the documents into the respective representations, as well as the time to train and test the classifier, averaged over all folds where multiple runs were performed. We have selected a range of datasets on which to evaluate this approach. We have both “document” level datasets, representing units that are (at least usually) several sentences or more long, and “sentence” level datasets, representing units of about one sentence (although sometimes a phrase, or two or three sentences). We also have a contrast between sometimes poorly written online user reviews of products and more professionally written movie reviews. We have one set of datasets which contain an order of magnitude more documents than the others on which to examine the effects of adding more documents. Finally, we have one dataset that is for a slightly different problem than the others – subjectivity detection rather than sentiment analysis. 5.1
Small Amazon Reviews
This dataset consists of online user reviews across 4 categories and was used in [2]. As shown in Table 1, the numeric features are always slightly more accurate with substantially higher AUC, while also being considerably faster than the BOW method, but neither performs as well as the linear predictor method used in [2]. The Electronics portion of this dataset was used for parameter tuning. These datasets were manually balanced such that each class represents 50% of the documents, which is not a very natural distribution. In this case, we also chose to look at how a classifier trained and tested on all datasets together performed in comparison to the average of all classifiers. We found that both can train a classifier using all of the data that is slightly better than the average performance of the individual classifiers, however, training this classifier is very, very slow for the BOW method – so much slower we only performed 2x10 fold cross validation and used a different, faster, computer, rather than 10x10 cross validation as in the other cases, and it still took many times longer to train. On the other hand, the numeric method trains a combined classifier slightly faster than the sum of the individual numeric classifiers; these two methods clearly scale up very differently as we add more documents.
Compact Features for Sentiment Analysis
151
Table 1. Amazon reviews, BOW vs. Numeric, Accuracy, AUC, and Time Dataset Electronics (0.844)a DVD (0.824) Book (0.804) Kitchen (0.877) All Datab Averagec
Type Numeric BOW Numeric BOW Numeric BOW Numeric BOW Numeric BOW Numeric BOW
Majorityd a b
c
d
Accuracy 0.801 ± 0.005 0.791 ± 0.005 0.797 ± 0.005 0.775 ± 0.006 0.768 ± 0.005 0.754 ± 0.006 0.814 ± 0.006 0.809 ± 0.005 0.796 ± 0.003 0.791 ± 0.002 0.795 ± 0.005 0.782 ± 0.006 0.474
AUC 0.874 ± 0.004 0.791 ± 0.005 0.865 ± 0.005 0.776 ± 0.006 0.839 ± 0.005 0.754 ± 0.006 0.896 ± 0.004 0.809 ± 0.005 0.874 ± 0.002 0.791 ± 0.002 0.869 ± 0.005 0.782 ± 0.006 0.500
Time(mm:ss) 0:01.0 0:22.2 0:01.4 0:37.8 0:01.4 0:42.9 0:01.0 0:23.5 0:04.6 10:00.1 0:04.9 2:06.5
Accuracy in [2] One classifier trained on all datasets. BOW classifier is 2x10 CV, all other classifiers 10x10 CV Average performance of the individual classifiers over all datasets and total time to train the individual classifiers The results for the majority classifier are the same for all datasets given the same seed for the split of the data into folds
5.2
Very Large Datasets
This collection of data is a larger set of Amazon.com reviews from which the previous datasets were created. This collection included three domains with over 100,000 reviews – books, DVDs, and music, which allows us to explore how this approach scales to very large datasets. We used these larger datasets to examine how the numeric features scale in terms of both performance and time. In all cases, the results are reported on a single run using a 10,000 review test set (which is larger than most complete datasets used in previous research). These datasets are all highly imbalanced, with the majority class (5 star reviews) containing from 61-71% of the documents, and the minority class (2 star reviews) containing 4-6% of the reviews. As shown in Figure 2, the time required to train with the numeric features scales much more gently than the time required to train with the BOW features. Note that the graph features a logarithmic scale for time. For time reasons, we only trained BOW based classifiers on up to 10-15,000 reviews, while we trained the classifiers using numeric features on over 100,000 reviews for each dataset, with 300,000 reviews for the Books dataset. For the books dataset, it took 8 hours and 23 minutes to train on 15000 documents with BOW features, while with the numeric features we were able to train on 300,000 documents in 1 hour and 15 minutes. In the case of the numeric features, the vast majority of the time is spent scoring words and generating the features, while for BOW most of the time is spent training the machine learning algorithm.
152
L. Gaudette and N. Japkowicz ϭϬϬϬϬϬ
dŝŵĞ;ƐͿ
ϭϬϬϬϬ ϭϬϬϬ ϭϬϬ ŽŽŬƐEƵŵĞƌŝĐ sEƵŵĞƌŝĐ DƵƐŝĐEƵŵĞƌŝĐ
ϭϬ
ŽŽŬƐKt sKt DƵƐŝĐKt
ϭ Ϭ
ϱϬϬϬϬ ϭϬϬϬϬϬ ϭϱϬϬϬϬ ϮϬϬϬϬϬ ϮϱϬϬϬϬ ϯϬϬϬϬϬ EƵŵďĞƌŽĨZĞǀŝĞǁƐ
Fig. 2. Time required to train classifiers based on Numeric and Bag of Words features using varying amounts of data, 3 large Ordinal datasets
Figure 3a shows performance of both approaches with up to 15,000 reviews. In this range, the numeric features generally perform better by MSE, although at 15,000 Book reviews BOW is very slightly better. Figure 3b extends these performance results to show the space where we only tested the numeric features; the straight dotted lines represent the performance on the largest BOW classifier, 15,000 or 10,000 reviews depending on the dataset. This shows that if large numbers of documents are available, the numeric method continues to improve. Similar results are obtained when looking at these datasets in terms of binary classification, with one and two star reviews as the negative class and four and five star reviews as the positive class. 5.3
Movie Review Datasets
We use a number of datasets created by Bo Pang & Lillian Lee in the domain of movie reviews. The movie review polarity dataset (version 2) and a dataset for sentence level subjectivity detection are introduced in [9], while a dataset for ordinal movie reviews by four different authors and a dataset for the sentiment of movie review “snippets” (extracts selected by RottenTomatoes.com) are introduced in [10]. Table 2 presents the results on the three binary datasets, as well as the results reported by Pang & Lee on the datasets, where available. While in the case of the Binary Movie reviews the numeric features fall well short of their reported results, on the Subjective Sentences dataset they are very close. Note that this dataset is for the related problem of subjectivity detection and not sentiment analysis. Pang & Lee report results on 10 fold cross validation, while we report results on 10 runs of 10 fold cross validation in order to be less sensitive to the random split of the data.
2.1
1.5
1.9
1.4
Mean Squared Error
Mean Squared Error
Compact Features for Sentiment Analysis
1.7 1.5 1.3 1.1
0.9 0.7
153
1.3 1.2 1.1 1 0.9 0.8
0.5 0
5000 10000 Number of Reviews Books Numeric DVD Numeric Music Numeric
15000
Books BOW DVD BOW Music BOW
(a) Up to 15,000 Reviews
0
100
200
300
Number of Reviews (Thousands) Books Numeric Books BOW 15000 DVD Numeric DVD BOW 10000 Music Numeric Music BOW 10000 (b) Over 10,000 Reviews
Fig. 3. Mean Squared Error, Numeric vs. BOW, 3 large Ordinal datasets Table 2. Binary Datasets, Average Performance and Time, with 95% confidence intervals Accuracy AUC Time (m:ss) Movie Review Polarity (Pang & Lee Accuracy: 0.872) Numeric 0.824 ± 0.005 0.896 ± 0.005 0:03 BOW 0.850 ± 0.005 0.850 ± 0.005 1:18 Movie Review Snippets Numeric 0.760 ± 0.002 0.841 ± 0.002 0:05 BOW 0.739 ± 0.002 0.739 ± 0.002 19:56 Subjective Sentences (Pang & Lee Accuracy: 0.92 ) Numeric 0.910 ± 0.002 0.967 ± 0.001 0:05 BOW 0.880 ± 0.002 0.880 ± 0.002 9:15 Table 3. Ordinal Movie Reviews, BOW vs. Numeric, Accuracy, MSE, and Time, with 95% confidence intervals Author Schwartz (0.51)a Berardinelli (0.63) Renshaw (0.50) Rhodes (0.57) a
Type MSE Numeric 0.580 ± 0.013 BOW 0.691 ± 0.020 Numeric 0.478 ± 0.010 BOW 0.443 ± 0.012 Numeric 0.634 ± 0.016 BOW 0.696 ± 0.020 Numeric 0.490 ± 0.010 BOW 0.478 ± 0.011
Accuracy Time (s) 0.518 ± 0.009 0.83 0.510 ± 0.009 13.20 0.557 ± 0.008 1.48 0.644 ± 0.007 35.64 0.468 ± 0.011 1.05 0.496 ± 0.009 13.39 0.566 ± 0.007 1.65 0.609 ± 0.006 43.79
Accuracy obtained by Pang & Lee
Table 3 reports the results on the ordinal movie reviews, by author. Again, we compare results to Pang & Lee, noting that we are approximating the values reported in a graph. Comparing to Pang & Lee based on accuracy, we find that in one case, the numeric features appear to be slightly better than their best result,
154
L. Gaudette and N. Japkowicz
and in one other case, Pang & Lee’s result is within the confidence range of our numeric features. In the two other cases, Pang & Lee’s result is better than our result for the numeric feature set. However, we also note the comparison of their result to our simple BOW; in two cases our simple BOW classifier appears to be better, while in the other two the results are virtually the same. This confirms our assessment that this simple BOW is a good baseline to compare against.
6
Comparison with Feature Selection Methods
Another approach one might take to speeding up BOW is the idea of feature selection – selecting the most relevant features. In this section, we briefly compare the numeric features, plain BOW features, and BOW features reduced through two fast feature selection methods, Chi Squared and Information Gain. We use 5 fold cross validation on the subjective sentences, binary electronics (2000 review balanced version), and movie review snippets datasets. These feature selection methods both evaluate individual attributes; methods which evaluate subsets of attributes together exist but are much slower [5]. We show the results of these experiments in 4. Table 4. Feature Selection Method BOW Numeric Chi Squared Chi Squared Chi Squared Chi Squared Chi Squared Info Gain Info Gain Info Gain Info Gain Info Gain
Subjective Electronics Snippets Features Accuracy Time Accuracy Time Accuracy Time (m:ss) (s) (m:ss) – 0.874 11:27.0 0.786 19.3 0.732 37:21.3 25 0.911 0:04.2 0.790 1.4 0.755 0:04.4 100 0.828 0:54.4 0.789 5.7 0.655 0:58.5 250 0.860 1:12.8 0.807 8.0 0.697 2:15.7 500 0.877 2:18.9 – – – – 1000 0.883 3:54.1 0.775 14.1 0.743 3:58.5 1500 – – – – 0.748 5:26.7 100 0.830 0:55.8 0.788 5.2 0.654 0:58.4 250 0.862 1:22.9 0.807 7.8 – – 500 0.878 1:56.1 0.780 13.5 – – 1000 0.883 3:03.5 – – 0.743 4:24.9 1500 – – – – 0.748 8:09.4
For the subjective sentences dataset, feature selection by both methods performed slightly better than plain BOW with 500 selected features and even better with 1000, and with substantial time savings over plain BOW. However, the numeric features are still much faster than any of the feature selection methods, and achieve the highest accuracy by a substantial margin. On the electronics dataset all classifiers complete in seconds but the numeric features are still the fastest. However, in this instance, the feature selection methods which select 250 features both achieve slightly higher accuracy than the numeric features, and, while slower, this difference in time may not be meaningful on a dataset of this
Compact Features for Sentiment Analysis
155
size, as both complete in under 10 seconds. Finally, on the Movie Review Snippets dataset we again see that feature selection can save considerable time over plain BOW, and improve performance slightly with enough features, however the numeric method again is both much faster and more accurate.
7
Discussion
This work decomposes the problem of learning the sentiment of documents into two simpler parts: scoring the strength of different words based on their distribution in positive and negative documents, and then learning document sentiment based on the distribution of those scores. This produces a compact representation, of around 25 features, compared with thousands for an effective BOW based approach. While this decomposition results in the loss of some information – for instance, if two words appearing together in a document is significant – it appears as if the BOW representation may be too sparse for such relationships to be learned meaningfully. It seems as though the machine learning algorithms for the BOW representation are mainly learning which words are significant indicators of sentiment, but are much slower at this than simple word scoring methods. The numeric features performed better than BOW in all respects on the two sentence level datasets. In addition, they also performed well on the online user reviews, While the gap in performance narrowed on some datasets when comparing the largest trained BOW classifiers and the numeric classifiers trained with the same number of documents, the numeric features make training on very large datasets much more feasible, and we saw that performance continued to improve when using numeric features on larger and larger datasets. In addition, the most time consuming part of our approach is generating the features. In a system where new documents are frequently being added, such as an online review website, the words in each document only need to be counted once. This would reduce the time needed to update the system with new information, while the BOW/SVM approach would need to be completely retrained to account for new documents. On the document level movie review datasets, the results are mixed. With the ordinal datasets, the numeric features perform slightly better on accuracy on one of the authors, and we see the worst relative performance overall on two of the authors. However, we note that the MSE differences are relatively large in favor of the numeric features on two of the datasets, and relatively small in favor of BOW on the other two. In the binary movie reviews, BOW has higher accuracy, and while the numeric features retain their advantage in terms of AUC, it is the smallest gap we see on that measure. These two datasets contain both relatively long and relatively well written material. While not conclusive, it may be that the numeric features are particularly adept at dealing with the shorter, less well written material found in all manner of less formal online discourse including online user reviews – which is a more interesting domain in many respects than professionally written reviews.
156
8
L. Gaudette and N. Japkowicz
Conclusions
We have shown that it is possible to greatly condense the features used for machine learning of sentiment analysis and other related tasks for large speed improvements. Second, we have shown that these features often improve performance over a simple BOW representation, and are competitive with other published results. These speed improvements make it possible to process data sets orders of magnitude larger than previously attempted for sentiment analysis, which in turn generally leads to further performance improvements. This method is effective on both longer and shorter documents, as well as on small and large datasets, and may be more resilient to poorly written documents such as those found in online user reviews. In addition, we have briefly compared this approach with a feature selection based approach. While feature selection can improve speed over plain BOW considerably, and can also increase performance, the numeric features remain considerably faster, particularly on larger datasets, and exceeded the performance of the best feature selection methods on two of the three datasets we examined, and were close on the other.
References 1. Andreevskaia, A., Bergler, S.: When specialists and generalists work together: Overcoming domain dependence in sentiment tagging. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (2008) 2. Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007 (2007), http://acl.ldc.upenn.edu/P/P07/P07-1056.pdf 3. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: Opinon extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web (2003) 4. Gaudette, L., Japkowicz, N.: Evaluation methods for ordinal classification. In: Proceedings of the Twenty-second Canadian Conference in Artificial Intelligence, AI 2009 (2009) 5. Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering 15(6), 1437–1447 (2003) 6. Kennedy, A., Inkpen, D.: Sentiment classification of movie reviews using contextual valence shifters. Computational Intelligence 32(2), 223–262 (2006), http://www.site.uottawa.ca/~ diana/publications.html 7. Martineau, J., Finin, T.: Delta TFIDF: An improved feature space for sentiment analysis. In: Third AAAI Internatonal Conference on Weblogs and Social Media (2009) 8. Mullen, T., Collier, N.: Sentiment analysis using support vector machines with diverse information sources. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004), http://acl.ldc.upenn. edu/acl2004/emnlp/pdf/Mullen.pdf
Compact Features for Sentiment Analysis
157
9. Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: ACL 2004: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 271. Association for Computational Linguistics, Morristown (2004) 10. Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: ACL 2005: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 115–124. Association for Computational Linguistics, Morristown (2005) 11. Pang, B., Lee, L.: Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval, vol. 2. Now (2008) 12. Wiebe, J., Wilson, T., Bell, M.: Identifying collocations for recognizing opinions. In: Proceedings of the ACL 2001 Workshop on Collocation (2001) 13. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Instance Selection in Semi-supervised Learning Yuanyuan Guo1 , Harry Zhang1 , and Xiaobo Liu2 1
2
Faculty of Computer Science, University of New Brunswick P.O. Box 4400, Fredericton, NB, Canada E3B 5A3 {yuanyuan.guo,hzhang}@unb.ca School of Computer Science, China University of Geosciences Wuhan, Hubei, China 430074 [email protected]
Abstract. Semi-supervised learning methods utilize abundant unlabeled data to help to learn a better classifier when the number of labeled instances is very small. A common method is to select and label unlabeled instances that the current classifier has high classification confidence to enlarge the labeled training set and then to update the classifier, which is widely used in two paradigms of semi-supervised learning: self-training and co-training. However, the original labeled instances are more reliable than the self-labeled instances that are labeled by the classifier. If unlabeled instances are assigned wrong labels and then used to update the classifier, classification accuracy will be jeopardized. In this paper, we present a new instance selection method based on the original labeled data (ISBOLD). ISBOLD considers not only the prediction confidence of the current classifier on unlabeled data but also its performance on the original labeled data only. In each iteration, ISBOLD uses the change of accuracy of the newly learned classifier on the original labeled data as a criterion to decide whether the selected most confident unlabeled instances will be accepted to the next iteration or not. We conducted experiments in self-training and co-training scenarios when using Naive Bayes as the base classifier. Experimental results on 26 UCI datasets show that, ISBOLD can significantly improve accuracy and AUC of selftraining and co-training. Keywords: self-training, co-training, instance selection.
1
Introduction
In many real-world machine learning applications, it may be expensive or timeconsuming to obtain a large amount of labeled data. On the other hand, it is relatively easy to collect lots of unlabeled data. Learning classifiers from a small number of labeled training instances may not produce good performance. Therefore, various algorithms have been proposed to exploit and utilize the unlabeled data to help to learn better classifiers. Semi-supervised learning is one kind of such algorithms that use both labeled data and unlabeled data. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 158–169, 2011. c Springer-Verlag Berlin Heidelberg 2011
Instance Selection in Semi-supervised Learning
159
Many semi-supervised learning algorithms have been proposed in the past decades, including self-training, co-training, semi-supervised support vector machines, graph-based methods, and so on [2,13]. The general idea of self-training [12] and co-training [1] is to iteratively pick some unlabeled instances according to a given selection criterion and move them (together with the labels assigned by the classifier) to the training set to build a new classifier. These selected instances are called “self-labeled” instances in [5]. The main difference between self-training and co-training is that, in co-training, the attributes are split into two separate sub-views and every operation is conducted on the two sub-views, respectively. A commonly used instance selection criterion is “confidence selection” which selects unlabeled instances that are predicted by the current classifier with high confidence [1,2,6,8,12], that is, the instances with the high class membership probabilities. Other selection methods have also been proposed by researchers. Wang et al. presented an adapted Value Difference Metric as the selection metric in self-training, which does not depend on class membership probabilities [10]. In [5], a method named SETRED is presented that utilizes the information of the neighbors of each self-labeled instance to identify and remove the mislabeled examples from the self-labeled data. Ideally, the selected unlabeled instances (together with the predicted labels) can finally help to learn a better classifier. In [3], however, it concludes that unlabeled data may degrade classification performance in some extreme conditions and under common assumptions when the model assumptions are incorrect. In our previous work [4], an extensive empirical study was conducted on some common semi-supervised learning algorithms (including self-training and co-training) using different base Bayesian classifiers. Results on 26 UCI datasets show that, the performance of using “confidence selection” is not necessarily superior to that of randomly selecting unlabeled instances. If the current classifier has poor performance and wrongly assigns labels to some self-labeled instances, the final performance will be jeopardized due to the accumulation of mislabeled data. It is a general problem for the methods based on the classifier performance on the expanded data, including the original labeled data and the self-labeled data. Since the originally labeled instances are generally more reliable than selflabeled instances, the performance on the former instances alone is more critical. Thus, we conjecture that, the classifier should have a good performance on the original labeled data if it wants to have good prediction performance on future data. More precisely, when the accuracy of the classifier evaluated on the original labeled data decreases, the accuracy on the future testing set generally degrades as well. Hence, utilizing the accuracy on the original labeled data to select more reliable unlabeled instances seems crucial to the final performance of semi-supervised learning. In this paper, we present an effective instance selection method based on the original labeled data (ISBOLD) to improve the performance of self-training and co-training when using Naive Bayes (NB) as the base classifier. ISBOLD considers both the prediction confidence of the current classifier on the self-labeled
160
Y. Guo, H. Zhang, and X. Liu
data and the accuracy on the original labeled data only. In each iteration, after the selection of the most confident unlabeled instances, the accuracy of the current classifier on the original labeled data is computed and then used to decide whether to add the selected instances to the training set in the next iteration. Experiments on 26 UCI datasets demonstrate that, ISBOLD significantly improves the accuracy of self-training and co-training on 6 to 7 datasets and prevents the performance being degraded on the other datasets, compared to our experimental results in [4]. Besides, ISBOLD significantly improves AUC on 8 to 9 datasets. The rest of the paper is organized as follows. Section 2 briefly describes selftraining and co-training algorithms and reviews related research work. A new instance selection method based on the original labeled data (ISBOLD) is presented in Section 3. Section 4 shows experimental results on 26 UCI datasets, as well as detailed performance analysis. Finally, it is concluded in Section 5.
2
Related Work
Semi-supervised learning methods utilize unlabeled data to help to learn better classifiers when the amount of labeled training data is small. A set L of labeled training instances and a set U of unlabeled instances are given in semi-supervised learning scenario. In [13], a good survey of research work on several well-known semi-supervised learning methods has been given. These algorithms and their variants are also analyzed and compared in [2]. Self-training and co-training are two common algorithms among them. 2.1
Self-training and Co-training Algorithms
Self-training works as follows [12]. A classifier is built from L and used to predict the labels for instances in U . Then m instances in U that the current classifier has high classification confidence are labeled and moved to enlarge L. The whole process iterates until stopped. Co-training works in a similar way except that it is a two-view learning method [1]. Initially, the attribute set (view) is partitioned into two conditionally independent sub-sets (sub-views). A data pool U is created by randomly choosing some instances from U for each sub-view, respectively. On each subview, a classifier is built from the labeled data and then used to predict labels for the unlabeled data in its data pool. A certain number of unlabeled instances that one classifier has high classification confidence are labeled and moved to expand the labeled data of the other classifier. And the same number of unlabeled instances will be randomly moved from U to replenish U . Then the two classifiers are rebuilt from their corresponding updated labeled data, respectively. The process iterates until stopped. In other words, in co-training, it iteratively and alternately uses one classifier to help to “train” another classifier. The stopping criterion in self-training and co-training is that, either there is no unlabeled instance left or the maximum number of iterations has been reached.
Instance Selection in Semi-supervised Learning
161
There are two assumptions in co-training to ensure good performance [1]: each sub-view is sufficient to build a good classifier; and the two sub-views are conditionally independent of each other given the class. The two assumptions may be violated in real-world applications. In [8], it is stated that, co-training still works when the attribute set is randomly divided into two separate subsets, although the performance may not be as good as when the attributes are split sufficiently and independently. 2.2
Variants of Self-training and Co-training Algorithms
Researchers have presented different variants of self-training and co-training algorithms. One kind of methods is to use all the unlabeled instances in each iteration so that no selection criterion is needed. A self-training style method, semisupervised EM, is presented in [9]. During each iteration, all the unlabeled instances are given predicted labels and then used to enlarge the training set and update the classifier. In [8], co-training is combined with EM to generate a new algorithm co-EM which in each iteration uses all the unlabeled instances instead of a number of instances picked from the data pool. Another kind of methods is to use active learning method to select unlabeled instances and then ask human experts to label them. Hence, no mislabeled examples will occur, in principle. In [7], an active learning method is used to select unlabeled instances for the multi-view semi-supervised Co-EM algorithm. And labels are assigned to the selected unlabeled instances by experts. However, active learning methods are not applicable if we do not have available human experts. Some researchers also used different selection techniques to decide which unlabeled instances should be used in each iteration. In [10], the authors presented an adapted Value Difference Metric as the selection metric in self-training. In [5], a data editing method is applied to identify and remove the mislabeled examples from the self-labeled data. In our previous work [4], an empirical study on 26 UCI datasets shows that, in self-training and co-training, using “confidence selection” cannot always outperform that of randomly selecting unlabeled instances. If the classification performance of the current classifier is poor, wrong labels may be predicted to most unlabeled instances and the final performance of semi-supervised learning will be affected accordingly. Generally speaking, the original labeled instances are more reliable than the instances with predicted labels by the current classifier. Hence, the performance on the original labeled data is an important factor to reflect the final performance of semi-supervised learning.
3
Instance Selection Based on the Original Labeled Data
Motivated by the existing work, in this paper, we present a new method, Instance Selection Based on the Original Labeled Data (ISBOLD), to improve the
162
Y. Guo, H. Zhang, and X. Liu
performance of self-training and co-training when using NB as the base classifier. The main idea of ISBOLD is to use the accuracy on the original labeled data only to prevent adding unlabeled instances that will possibly degrade the performance. How to use ISBOLD in self-training and co-training scenarios is described in following two subsections, respectively. 3.1
ISBOLD for Self-training
In order to describe our method, some notations are used here. In iteration t, we use Lt to denote the new labeled training set, Ct to represent the classifier built on Lt , and Acct as the accuracy of Ct on the original labeled data L0 . The detailed algorithm is shown in Figure 1.
Set t, the iteration counter, to 0. Build a classifier Ct on the original labeled data L0 . Compute Acct , which is the accuracy of Ct on L0 . While the stopping criteria are not satisfied, (a) Use Ct to predict a label for each instance in U . (b) Generate Lst+1 : select m unlabeled instances that Ct has high classification confidence, and assign a predicted label to each selected instance. Delete the selected instances from U . (c) Lt+1 = Lt ∪ Lst+1 . (d) Build a classifier Ct+1 on Lt+1 . (e) Compute Acct+1 , which is the accuracy of Ct+1 on L0 . (f) If Acct+1 < Acct , then Lt+1 = Lt , and rebuild Ct+1 on Lt+1 . (g) Increase t by 1. 5. Return the final classifier. 1. 2. 3. 4.
Fig. 1. Algorithm of ISBOLD for self-training
The difference between ISBOLD and the common confidence selection method in self-training is displayed in steps 4(e) and 4(f). In iteration t + 1, after selecting the most confident unlabeled instances and assigning labels to them (for simplicity, the set of those selected instances is denoted as Lst+1 ), the training set Lt+1 = Lt ∪ Lst+1 . Now we build a classifier Ct+1 on Lt+1 and compute Acct+1 . If Acct+1 < Acct , Lt+1 is reset to be equal to Lt , and Ct+1 is updated on Lt+1 accordingly. The whole process iterates until there is no unlabeled instance left or the maximum number of iterations is reached. The reason that we remove Lst+1 from Lt+1 once the accuracy on L0 decreases is that, if adding Lst+1 to the training set degrades the classifier’s performance on L0 , it is very possible that the performance of the current classifier on the test set degrades as well. Hence, we use this method to roughly prevent possible performance degradation. Furthermore, notice that in step 4(b), all the selected instances are removed from U , which means that each selected instance is either added to the labeled data or removed from U .
Instance Selection in Semi-supervised Learning
3.2
163
ISBOLD for Co-training
A similar selection method is used in co-training. We denote the classifiers on the two sub-views in iteration t as Cta and Ctb . The algorithm is shown in Figure 2.
1. Set t, the iteration counter, to 0. 2. Randomly partition the attribute set Att into two separate sets Atta and Attb . Generate La0 and Lb0 from L. Generate Ua and Ub from U . 3. Generate data pool Ua and Ub by randomly choosing u instances from Ua and Ub , respectively. 4. Use La0 to train a classifier Cta . 5. Use Lb0 to train a classifier Ctb . 6. Compute Accat , which is the accuracy of Cta on La0 . 7. Compute Accbt , which is the accuracy of Ctb on Lb0 . 8. While the stopping criteria are not satisfied, (a) Use Cta to predict a label for each instance in Ua . Use Ctb to predict a label for each instance in Ub . s (b) Generate Lat+1 : select m unlabeled instances that Ctb has high classification confidence, together with predicted labels. Delete the selected instances from Ub . s (c) Generate Lbt+1 : select m unlabeled instances that Cta has high classification confidence, together with predicted labels. Delete the selected instances from Ua . s s (d) Lat+1 = Lat ∪ Lat+1 . Lbt+1 = Lbt ∪ Lbt+1 . a a (e) Use Lt+1 to train a classifier Ct+1 . a (f) Compute Accat+1 , which is the accuracy of Ct+1 on La0 . a a a a a (g) If Acct+1 < Acct , then Lt+1 = Lt , and rebuild Ct+1 on Lat+1 . b (h) Use Lbt+1 to train a classifier Ct+1 . b (i) Compute Accbt+1 , which is the accuracy of Ct+1 on Lb0 . b b b b b (j) If Acct+1 < Acct , then Lt+1 = Lt , and rebuild Ct+1 on Lbt+1 . (k) Randomly move m instances from Ua to replenish Ua . Randomly move m instances from Ub to replenish Ub . (l) Increase t by 1.
Fig. 2. Algorithm of ISBOLD for co-training
The difference between ISBOLD and the common confidence selection method in co-training is displayed in steps 8(f), 8(g), 8(i) and 8(j). In iteration t + 1, on sub-view a, after selecting a certain number of unlabeled instances that Ctb has high classification confidence, a label is assigned to each selected instance s (for simplicity, the set of those selected instances is denoted as Lat+1 ). Then s a Lat+1 = Lat ∪ Lat+1 and Ct+1 is built on Lat+1 . Now we compute Accat+1 that a a represents the accuracy of Ct+1 on La0 . If Accat+1 < Accat , Lat+1 = Lat and Ct+1 is updated accordingly. The same steps are repeated on sub-view b to generate b Lbt+1 and Ct+1 . New unlabeled instances will be replenished from the remaining
164
Y. Guo, H. Zhang, and X. Liu
unlabeled data part to the data pool of each sub-view. The whole process iterates until there is no unlabeled instance left or the maximum number of iterations is reached.
4
Experimental Results and Analysis
4.1
Experimental Settings
In order to examine the performance of ISBOLD, we conducted experiments on 26 UCI datasets, including 18 binary class datasets and 8 multi-class datasets. These datasets are downloaded from a package of 37 classification problems, “datasets-UCI.jar”1. Each dataset is then preprocessed in Weka software [11] by replacing missing values, discretization and removing any attribute that its number of attribute values is almost equal to the number of instances in the dataset [4]. We only use 26 datasets out of the package because the other 11 datasets have extremely skewed class distributions. For example, in the hypothyriod dataset, the frequency of each class value is 3481, 194, 95 and 2 respectively. When randomly sampling the labeled data set in semi-supervised learning, the classes that have very small values of frequency may not appear in some generated datasets if we want to keep the same class distributions. Usually researchers merge the minor classes into a major class or simply delete instances with minor classes. However, to minimize any possible influence, we ignored those datasets with extremely skewed class distributions. The 26 datasets are the same as those used in our previous work [4]. On each dataset, 10 runs of 4-fold stratified cross-validation are conducted. That is, 25% of the original data will be put aside as the testing set to evaluate the performance of learning algorithms. The remaining 75% data are divided into labeled data (L) and unlabeled data (U ) according to a pre-defined percentage of labeled data (lp). The data splitting setting follows those in [1,4,5,6]. In our experiments, lp is set to be 5%. Therefore, 25% data are kept as the testing set, 5% of the 75% data are randomly sampled as L while the remaining 95% of the 75% data are saved as U . When generating L, we made sure that L and the original training data had the same class distributions. Naive Bayes is used in self-training and co-training. The maximum number of iterations in both is set to 80. The size of data pool in co-training is set to be 50% of the size of U . Accuracy and AUC are used as performance measurements. In our experiments on co-training, the attributes are randomly split into two subsets. 4.2
Results Analysis
Performance comparison results of using ISBOLD and using the common “confidence selection” method in self-training and co-training are shown in Table 1 and Table 2. For simplicity, the methods are denoted as ISBOLD and CF 1
They are available from http://www.cs.waikato.ac.nz/ml/weka/
Instance Selection in Semi-supervised Learning
165
Table 1. Accuracy of CF vs ISBOLD in self-training and co-training (a) self-training Dataset balance-scale breast-cancer breast-w colic colic.ORIG credit-a credit-g diabetes heart-c heart-h heart-statlog hepatitis ionosphere iris kr-vs-kp labor letter mushroom segment sick sonar splice vehicle vote vowel waveform-5000 mean w/t/l
CF 59.52 65.09 96.67 74.54 55.05 80.68 60.62 70.55 81.55 83.06 81.37 79.70 80.97 90.31 67.26 88.26 40.38 91.90 63.49 91.54 55.72 82.05 41.79 87.89 18.75 77.98 71.80
ISBOLD 66.21 65.61 96.34 75.38 60.57 80.78 66.03 v 70.53 81.15 82.41 80.74 78.34 79.86 90.05 80.07 v 87.92 57.39 v 92.57 v 72.88 v 94.15 57.93 85.48 v 48.35 88.53 21.78 78.87 74.61 6/20/0
(b) co-training Dataset balance-scale breast-cancer breast-w colic colic.ORIG credit-a credit-g diabetes heart-c heart-h heart-statlog hepatitis ionosphere iris kr-vs-kp labor letter mushroom segment sick sonar splice vehicle vote vowel waveform-5000 mean w/t/l
CF 59.10 70.41 96.85 76.60 55.19 81.36 63.04 67.51 82.77 81.46 82.03 81.04 81.50 80.79 59.22 77.21 36.67 91.74 61.49 93.40 55.43 73.91 41.57 88.21 18.83 71.61 70.34
ISBOLD 67.17 71.00 96.47 75.76 62.04 79.67 67.72 v 69.58 80.13 78.60 80.30 80.21 83.08 78.98 77.36 v 78.43 56.05 v 92.38 v 71.64 v 93.56 58.08 82.63 v 47.86 88.60 23.36 75.91 v 73.71 7/19/0
in the tables. In each table, figures on each row are the average accuracy or AUC over 10-runs of 4-fold cross-validation on the corresponding dataset. Row “w/t/l” represents that using ISBOLD in the corresponding column wins on w datasets (marked by ‘v’), ties on t datasets, and loses on l datasets (marked by ‘*’) against using “confidence selection” in self-training or co-training, under a two-tailed pair-wise t-test with the significant level of 95%. Values in row “mean” are the average accuracy or AUC over the 26 datasets. Table 1(a) shows the average accuracy of using ISBOLD and CF in selftraining. The “w/t/l” t-test results show that, ISBOLD significantly improves classification accuracy on 6 datasets. Values in row “mean” also demonstrate that ISBOLD improves the average performance. Table 1(b) shows the average accuracies in co-training. The “w/t/l” t-test results tell that ISBOLD significantly improves the performance of co-training on 7 datasets. And the mean value increases from 70.34 to 73.71.
166
Y. Guo, H. Zhang, and X. Liu
Table 2. AUC of CF vs ISBOLD in self-training and co-training (a) self-training Dataset balance-scale breast-cancer breast-w colic colic.ORIG credit-a credit-g diabetes heart-c heart-h heart-statlog hepatitis ionosphere iris kr-vs-kp labor letter mushroom segment sick sonar splice vehicle vote vowel waveform-5000
CF 61.37 63.98 99.07 79.24 51.62 86.81 56.56 78.03 83.97 83.74 88.93 83.02 86.86 98.33 74.65 96.59 86.09 98.04 90.86 91.51 58.64 94.40 59.63 96.31 57.65 88.85
ISBOLD 66.68 63.48 99.08 78.43 58.49 86.79 65.24 v 76.36 83.92 83.74 88.64 80.99 86.68 98.29 89.03 v 96.72 93.08 v 98.81 v 95.24 v 93.96 62.21 96.23 v 66.95 v 96.52 64.49 v 90.96 v
mean w/t/l
80.57 83.12 9/17/0
(b) co-training Dataset balance-scale breast-cancer breast-w colic colic.ORIG credit-a credit-g diabetes heart-c heart-h heart-statlog hepatitis ionosphere iris kr-vs-kp labor letter mushroom segment sick sonar splice vehicle vote vowel waveform-5000
CF 60.44 63.51 99.22 78.99 49.62 88.05 55.33 72.61 84.02 83.77 90.03 78.38 87.89 93.21 66.86 87.76 82.98 97.89 87.93 87.74 59.59 88.65 59.56 96.31 57.97 84.22
ISBOLD 65.34 64.37 99.19 79.08 55.82 86.35 61.62 74.95 83.80 83.50 88.03 73.19 88.92 92.27 86.39 v 85.18 92.57 v 98.75 v 94.82 v 93.83 62.93 94.87 v 67.09 v 96.46 66.44 v 89.54 v
mean w/t/l
78.56 81.74 8/18/0
Comparison results on AUC in self-training and co-training are displayed in Table 2. It can be observed that, using ISBOLD, the AUC of self-training is significantly improved on 9 datasets. And the mean value increases from 80.57 to 83.12. Similarly, the AUC of co-training is sharply improved on 8 datasets, and the mean value is improved from 78.56 to 81.74. 4.3
Learning Curves Analysis
Based on our previous work [4], we guess that, the classifier should have a good prediction performance on the testing set if the accuracy on the original labeled data does not degrade. To verify our conjecture and to further examine the performance of ISBOLD during each iteration, learning curves of a random running of two self-training methods on datasets vehicle and kr-vs-kp are displayed in
Instance Selection in Semi-supervised Learning
167
Figure 3 and Figure 4, respectively. The data splitting setting is the same as that in subsection 4.1. Curves in co-training are omitted here due to space limitation. On each graph, at each iteration t, the accuracy values of classifier Ct on the original labeled data L0 and the testing set for using ISBOLD or CF in self-training are displayed, respectively. Curves “ISBOLD-L0” and “ISBOLDtest” show accuracy values on the original labeled data L0 and on the testing set, respectively, when using ISBOLD in self-training on the dataset. Curves “CF-L0” and “CF-test” display accuracy values on L0 and on the testing set, respectively, when using “confidence selection” in self-training on the dataset. vehicle 1
0.9
ISBOLD−L0 ISBOLD−test CF−L0 CF−test
Accuracy
0.8
0.7
0.6
0.5
0.4
0
10
20
30
40
50
60
70
80
90
the number of iterations
Fig. 3. Learning curves on the vehicle dataset kr−vs−kp 1 0.95 0.9
Accuracy
0.85 0.8 0.75 0.7 0.65
ISBOLD−L0 ISBOLD−test CF−L0 CF−test
0.6 0.55
0
10
20
30
40
50
60
70
80
90
the number of iterations
Fig. 4. Learning curves on the kr-vs-kp dataset
According to our conjecture, when the accuracy on the original labeled data L0 decreases, the accuracy on the corresponding testing set generally decreases as well. This is actually observed on the trends of curve “CF-L0” and curve “CF-test” in Figure 3 and Figure 4. Curve “CF-test” generally goes down when curve “CF-L0” goes down.
168
Y. Guo, H. Zhang, and X. Liu
ISBOLD is presented based on our conjecture that the classifier will have good prediction performance on the testing set if its accuracy on the original labeled data does not degrade during each iteration. As shown in Figure 3 and Figure 4, comparing curves on “confidence selection” method to curves on ISBOLD method, ISBOLD can sharply improve the accuracy on the testing set while improving it on L0 . When the accuracy on L0 does not degrade, the final accuracy on the testing set does not significantly decrease. These observations confirm that, using the accuracy on the original labeled data to further decide whether to accept the selected unlabeled instances into the next iteration or not is an effective way to improve the performance in semi-supervised learning.
5
Conclusions and Future Work
In this paper, we presented a new instance selection method ISBOLD to improve the performance of self-training and co-training when using NB as the base classifier. During each iteration, after selecting a number of unlabeled instances that the current classifier has high classification confidence, we use the accuracy of the current classifier on the original labeled data to decide whether to accept the selected unlabeled instances to the labeled training set in the next iteration. Experiments on 26 UCI datasets show that ISBOLD can significantly improve the performance of self-training and co-training on many datasets. The learning curve analysis gives a vivid demonstration and experimentally proves the feasibility of our method. In future work, we will try different base classifiers such as non-naive Bayesian classifiers and decision trees, and extend the method to more semi-supervised learning methods. Besides, theoretical analysis will also be done to help to understand the functionality of the method. Based on these work, we will present new methods to improve the performance of semi-supervised learning.
References 1. Blum, A., Mitchell, T.: Combing labeled and unlabeled data with co-training. In: Proceedings of the 1998 Conference on Computational Learning Theory (1998) 2. Chapelle, O., Sch¨ olkopf, B., Zien, A. (eds.): Semi-supervised learning. MIT Press, Cambridge (2006) 3. Cozman, F.G., Cohen, I.: Unlabeled data can degrade classification performance of generative classifiers. In: Proceedings of the Fifteenth International Florida Artificial Intelligence Research Society Conference (2002) 4. Guo, Y., Niu, X., Zhang, H.: An extensive empirical study on semi-supervised learning. In: The 10th IEEE International Conference on Data Mining (2010) 5. Li, M., Zhou, Z.H.: SETRED: self-training with editing. In: Proceedings of the Advances in Knowledge Discovery and Data Mining (2005) 6. Ling, C.X., Du, J., Zhou, Z.H.: When does co-training work in real data? In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (2009)
Instance Selection in Semi-supervised Learning
169
7. Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the Nineteenth International Conference on Machine Learning (2002) 8. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th International Conference on Information and Knowledge Management (2000) 9. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000) 10. Wang, B., Spencer, B., Ling, C.X., Zhang, H.: Semi-supervised self-training for sentence subjectivity classification. In: The 21st Canadian Conference on Artificial Intelligence, pp. 344–355 (2008) 11. Witten, I.H., Frank, E. (eds.): Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 12. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995) 13. Zhu, X.J.: Semi-supervised learning literature survey (2008)
Determining an Optimal Seismic Network Configuration Using Self-Organizing Maps Machel Higgins1 , Christopher Ward1 , and Silvio De Angelis2 1
The University of the West Indies 2 University of Washington
Abstract. The Seismic Research Centre, University of the West Indies operates a seismic network that performs suboptimally in detecting, locating, and correctly determining the magnitude of earthquakes due to a diverse constituion of seismometers and the utilization of a site selection process that approximates an educated guess. My work seeks to apply Self-Organizing Maps (SOM) to arrive at the optimal network configuration and aid in site selection.
1
Introduction
The University of the West Indies, Seismic Research Centre (SRC), currently employs a network of seismometers, the Eastern Caribbean Seismic Network (ECSN) in all English speaking countries in the Eastern Caribbean - spanning the length of the island arc from Anguilla to Trinidad. The ECSN has been upgraded in stages since 1956 and currently constitutes seismometers of differing capabilities in regards to monitoring earthquakes and volcanoes. In upgrading the seismic network, the process of selecting new sites involved creating a denser and evenly spaced seismic network. This heterogeneous mix of seismometers, along with the site selection process that was used, has resulted in a seismic network configuration that may not be optimized in its ability to detect, locate, and correctly determine the magnitude of earthquakes. This work’s intention is to determine a seismic network configuration with the best magnitude detection capability. That is, the seismic network should be able to detect most or all earthquakes of appreciable size that the physical characteristics of the region will allow and record the largest magnitude earthquakes with the least amount of sensors being saturated (clipping). Several bodies of work and publications exist that examines the problem of optimizing seismic network configurations. Most of these previous works have considered the site selection process by employing ideas from optimal experimental designs in addressing the problem proposed by Kijko[3]. Steinberg et al [1,2] extended this idea to incorporate a statistical approach by minimizing the error of hypolocations employing D-criterion that takes into account multiple seismic event sources. Hardt and Scherbaum [4] used Simulated Annealing to determine an optimal seismic network of one event source and aftershock investigations. Bartal et al has presented an approach, the use of a Genetic Algorithm, C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 170–173, 2011. c Springer-Verlag Berlin Heidelberg 2011
Determining an Optimal Seismic Network Configuration Using SOMs
171
that produces similar results to Steinberg and allows for more flexibility in the space that a station is contained in and the number of event sources. Unfortunately, the parameters for the genetic algorithm become unwieldy when many more event sources are added. Most of the methods previously mentioned attempt to optimize the seismic network by minimizing the location error. This is only appropriate for very small networks with few seismic sources. With regard to the ECSN, the seismic array spans a large region with heterogeneous layer-velocity models and several seismic sources, attempting to model phase arrivals and their errors and the network’s efficacy becomes intractable. These methods can be altered to minimize the magnitude detected instead of error in phase arrivals but they will remain inapplicable to a reconfiguration of the ECSN. With the exception of Bartal[5] and Hardt[4] methods are strictly tied to the creation of a rectangular grid where stations are allowed to move between grid points around one or more epicentres. This is inappropriate for the region that the ECSN monitors; stations are placed on a slivered island archipelago surrounded by seismicity. To determine the minimum magnitude detection threshold of the ECSN, Brune source modelling[6,7] was carried out. A region of 890 x 890 km with the Eastern Caribbean island arc centred is divided into a grid, where, for each grid cell, the rupture dimensions[6,7,9]and expected shear wave amplitudes and corner frequencies for earthquake magnitudes from Mw = 2 to Mw =8 were modelled. Each amplitude is compared to all sites to determine, with attenuation from grid to site applied[8], the minimum amplitude that exceeds the ambient noise level in the same frequency band. Shear wave amplitudes with corner frequencies that were below the sites’ seismometers cut-off frequencies were disregarded. The noise in the band of 0.01 Hz to 50 Hz for all sites was analyzed using waveform data over a period of three months. Not all sites are equipped with seismometers with the frequency response ranges to investigate the noise in the band of interest so cubic spline interpolation was carried out wherever interpolated points were on the order of one surface wave wavelength from real data points. Fortunately, with a nominal surface wave velocity of ~35 km/s[10]for the Eastern Caribbean, all sites satisfied this requirement. An earthquake magnitude probability function was created with the intention of establishing the efficacy of the ECSN’s minimum magnitude detection threshold. A complete Earthquake catalogue, from various agencies monitoring seismicity within the region, was compiled and aggregated to cluster main and aftershock seismic events. From this compilation, probability density functions per magnitude range were derived for each grid cell of the grid previously identified in determination of the minimum magnitude threshold.
2
Optimizing Seismic Network with SOM
This project intends to minimize the earthquake magnitude detection by optimizing a seismic network via the application of a Self-Organizing Map (SOM). Implementing a SOM was opted for because of its SOM’s ability to, sort, order
172
M. Higgins, C. Ward, and S. De Angelis
and classify data [11]. To accomplish these tasks the SOM has been used extensively in data mining, signal processing and pattern recognition [11] and its prevalence is due to its simplistic algorithm and ability to transform input that is non-linear and arbitrary in dimension to a low-dimension output [11]. For these reasons and the direct relationship the solution has to real world space, it is advantageous to implement an SOM to solve the problem of optimizing a seismic network configuration. In minimizing the magnitude detection threshold the following inputs are considered; earthquake magnitude probability distribution; a pool of seismometers with known dynamic ranges, frequency response and sensitivities; sites and their ambient noise characteristics; and volcano locations. A two-dimensional weight map is created to be equivalent in scale but denser than the grid created to determine the minimum magnitude detection threshold. In this map each unit consists of a weight vector whose elements represent a seismometer chosen randomly from a definitive pool and an associated site with its characteristics. The association of seismometers to units is dynamic while a unit is fixed to the closest site. The weight vector of a unit is ultimately combined to produce an overall weight called the sensor-site suitability SSS, from the consideration of its seismometer’s performance of being able to detect earthquake magnitudes by performing Brune source modelling for a surrounding region of radius where magnitudes shear wave amplitude is attenuated by 5 decibels when scaled by the seismometer’s benchmarked sensitivity. The resulting SSS is an index between 0 and 1 and serves as the overall weight that will compare units during the SOM operations. SSS values will also be determined for inputs through the combination of the input’s vector by the comparison of seismometers characteristics to an idealized site located in a region where the probability of an earthquake occurrence in any magnitude is the average of the entire region.
Fig. 1. The SOM Network
A typical SOM generates its feature map by updating units’ weights through competitive learning: selecting an input value and determining the unit, or the Best Matching Unit, whose weight vector most closely matches that of the input value. The Best Matching Unit‘s (BMU) weight and the weights of surrounding units are then scaled to be more similar to the input value. The training in the SOM applied to the problem at hand has a different tactic in that the units in the BMU’s neighbourhood will not be trained to be more similar but the
Determining an Optimal Seismic Network Configuration Using SOMs
173
complement of the BMU. This tactic serves to direct the decision borders to not allow two or more seismometers of equivalent capabilities to be sited close to each other. Another tactic implemented is that a unit may become quickly resistant to training once it has deemed it has perfect suitability to monitoring in volcano its vicinity. These tactics have been employed in other works [12] and have been shown that map formation and solution is possible.
3
Discussion
To validate the results, the conventional measure of self-organization [13], the topographic error of the feature map will be ascertained by finding the average similarity of each unit. If self-organization has been achieved, units’s seismometers are then assigned to sites. The minimum magnitude detection threshold is then calculated for the resulting reconfigured seismic network. To date, there has been moderate self-organization and further refinement of the neightborhood update function is necessary. It is hoped that, with a succesful SOM, not only will the best magnitude dectection threshold be achieved but also new sites can be sited through this scheme.
References 1. Rabinowitz, N., Steinberg, D.M.: Optimal configuration of a seismographic network: A statistical approach. Bull. Seism. Soc. Am. 80(1), 187–196 (1990) 2. Steinberg, D.M., et al.: Configuring a seismograph network for optimal monitoring of fault lines and multiple sources. Bull. Seism. Soc. Am. 85(6), 1847–1857 (1995) 3. Kijko, A.: An algorithm for the optimum distribution of a regional seismic network - I. Pageoph. 115, 999–1009 (1977) 4. Hardt, M., Scherbaum, F.: The design of optimum networks for aftershock recordings. Geophys. J. Int. 117, 716–726 (1994) 5. Bartal, Y., et al.: Optimal Seismic Networks in Israel in the Context of the Comprehensive Test Ban Treaty. Bull. Seism. Soc. Am. 90(1), 151–165 (2000) 6. James, B.: Tectonic Stress and the spectra of seismic shear wave from earthquakes. J. Geophys. Res. 75, 4997–5009 (1970) 7. Brune, J.: Correction. J. Geophys. Res. 76, 5002 (1971) 8. James, B.: Attenuation of dispersed wave trains. Bull. Seism. Soc. Am. 53, 109–112 (1962) 9. Kanimori: The energy release in great earthquakes. J. Geophys. Res. 82, 2981–2986 (1977) 10. Beckles, D., Shepherd, J.B.: A program for estimating the hypocentral coordinates of regional earthquakes, Prestanda a La Primer Reunion de LA Ascoicacion IberoLatino Americarnca de Geofiscia (1977) 11. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE 78, 1464–1480 (1990) 12. Neme, A., Hernández, S., Neme, O., Hernández, L.: Self-Organizing Maps with Non-cooperative Strategies. In: Príncipe, J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 200–208. Springer, Heidelberg (2009) 13. Bauer, H., Herrmann, M., Villmann, T.: Neural Maps and Topographic Vector Quantization. Neural Networks 12(4-5), 659–676 (1999)
Comparison of Learned versus Engineered Features for Classification of Mine Like Objects from Raw Sonar Images Paul Hollesen1 , Warren A. Connors2 , and Thomas Trappenberg1 1
Department of Computer Science, Dalhousie University {hollense,tt}@cs.dal.ca 2 Defence Research and Development Canada [email protected]
Abstract. Advances in high frequency sonar have provided increasing resolution of sea bottom objects, providing higher fidelity sonar data for automated target recognition tools. Here we investigate if advanced techniques in the field of visual object recognition and machine learning can be applied to classify mine-like objects from such sonar data. In particular, we investigate if the recently popular Scale-Invariant Feature Transform (SIFT) can be applied for such high-resolution sonar data. We also follow up our previous approach in applying the unsupervised learning of deep belief networks, and advance our methods by applying a convolutional Restricted Boltzmann Machine (cRBM). Finally, we now use Support Vector Machine (SVM) classifiers on these learned features for final classification. We find that the cRBM-SVM combination slightly outperformed the SIFT features and yielded encouraging performance in comparison to state-of-the-art, highly engineered template matching methods.
1 Introduction Naval mine detection and classification is a difficult, resource intensive task. Mine detection and classification is dependent on the training and skill level of the human operator, the resolution and design of the sonar, and the environmental conditions that the mines are detected in. Research has occured over the last 25 years into both sensor development and processing of sonar data. Although the sensors and capability of mine countermeasures platforms have improved in this time, the issue of operator overload and fatigue have caused the duty cycles of mine detection and classification to be short, therefore diminishing the effectiveness of Mine Counter Measures (MCM) platforms. Recent research focuses on development of computer aided tools for detection and classification of bottom objects [1,2,3]. This typically takes the form of a detection phase where mine like objects are selected from the seabed image, and a classification phase where the objects are fitted to a multi-class set of potential mines. This detection and classification process has typically been implemented using a set of image processing tools (Z-test, matched filter), feature extraction, and template-based classification [1,2,3]. These techniques are effective at finding mines, but are sensitive to tuning the parameters for the processing method, and the sea bottom environment under test [2,3]. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 174–185, 2011. © Her Majesty the Queen in Right of Canada as Represented by the Minister of National Defence 2011
Comparison of Learned versus Engineered Features
175
Learning algorithms, such as Artificial Neural Networks, have been examined for the mine problem, however success has been limited, and these methods have required the training sets to closely reflect the sea bottom environment of the area where the system will be tested. Earlier work includes using a deep belief network (DBN) [4] which is a stack of multiple Restricted Boltzmann Machines (RBM) [5], to learn to extract features from side scan sonar data. This technique was sucessful in detecting mines with comparable performace to the traditional methods [4]. The RBM learning method is effective, however the Scale-invariant feature transform (SIFT) [6] has been very influential recently in vision and image processing, and has been applied to numerous image processing and feature extraction tasks sucessfully. At the same time, while the original work on RBM/DBN structures for feature learning and classification has shown the power of the DBN for feature extraction, no consideration was given to recent developments with the DBN model including sparseness constraints and a convolutional variation [4]. Imposing sparsity on the RBM regularizes the leaned model by decreasing the weights of nodes whose activity exceeds a prescribed sparsity, therefore simplifying the model it learns and providing a more compact representation of the input. The convolutional approach allows the model to scale to high-resolution imagery and further regularizes the model by reducing the parameter space. This paper compares feature extraction using SIFT versus a convolutional RBM (cRBM), for the mine classification problem. This also serves to examine how well SIFT generalizes to application domains analogous to visual wavelength imagery. A central argument for using learned rather than carefully selected, contrived features is the ability to apply the model to diverse application domains. This is an interesting domain to explore in this context, as there is a natural 2D, grayscale representation for sonar data, and the mine classification task contains most of the same challenges as generic object recognition: invariance to translation, rotation, luminance, clutter, and noise. Both techniques were applied to a series of sonar images to extract features, with the output fed to a Support Vector Machine (SVM) for training and classification. As the goal of this effort is to develop a classification system for sonar images of sea bottom objects with comparable performance to highly contrived methods, each technique is treated as a feature extraction method, with the features being passed to an SVM for training and classification. The results were compared to state-of-the-art template matching methods with encouraging results for correct classification of targets.
2 Synthetic Aperature Sonar Imagery Traditional side scan sonar imagery (e.g. Figure 1a) depicts objects by a strong bright region (highlight) where it is insonified by sound waves followed by a dark region (shadow) cast behind it. The size, shape and disposition of such features are important for both automated and manual methods in mine classification. This imagery may be littered with background noise coming from natural and artificial sources. In imaging sonars, range resolution is mostly determined by the bandwidth of the transmit pulse, while azimuthal resolution is determined by the length of the receiver array. While
176
P. Hollesen, W.A. Connors, and T. Trappenberg
bandwidth in modern sonars are sufficient to achieve a high range resolution, azimuthal resolution is difficult to improve due to the engineering limitations of constructing long arrays. Synthetic Aperture Sonar (SAS) is a recent side scan sonar technique being applied to detecting and classifying mine like objects. This technology is inspired by synthetic aperture radar, which is commonly used on terrestrial and space based radar sensors. Synthetic Aperture Sonar is a technique whereby a longer array length is synthesized by integrating a number of sonar pings in the direction of travel of the sonar, resulting in improved resolution which is also independent of range (e.g. Figure 1b). This provides a powerful tool for the mine detection/classification problem, as the higher fidelity images allow for a richer set of features available for the detection and classification of a sea bottom object.
Fig. 1. Sonar images of mine-like objects, showing (a) a Side Scan Sonar image from and (b) an image from the MUSCLE Synthetic Aperture Sonar (SAS) [7] of the same type of object
The data used in this paper was collected by the NATO Undersea Research Center [7] on the MUSCLE Autonomous Underwater Vehicle (AUV). This vehicle is equipped with a 300KHz SAS. The SAS gives an 2.5 cm × 2.5 cm resolution, at up to 200 meters in range. 2.1 Data Preparation The SAS dataset was collected in the summer of 2008 off Latvia in the Baltic Sea. The MUSCLE vehicle was used to survey multiple mine-like targets that were deployed as part of the trial, including multiple sonar passes over each target in the field from different angles. The targets were three mine-like shapes, including a cylinder (2.0m × 0.5m), a truncated cone (1.0m base, 0.5m height), and a wedge shape (1.0m × 0.6m × 0.3m). Clutter included numerous rocks and boulders, geographic features of the sea bottom, and a specific rock which was chosen due to its similarity in shape with the truncated cone. The dataset was composed of 65 cylinders, 69 truncated cones, 37 wedges, and 2218 non-mine clutter objects, including 47 rock images that are highly correlated to a target shape.
Comparison of Learned versus Engineered Features
177
The raw SAS data can contain as much as ten times the data of a side scan sonar, and the maximum and minimum values for these samples describe a very large dynamic range for the sensor. The data is organized in complex values which describe the amplitude and phase data from the sonar. Although it is appealing to examine the phase component of the SAS data, it is beyond the scope of this work, and is considered in the Outlook section as future work. The data was prepared by removing the phase component, then re-mapping the amplitude component to a decibel (dB) scale.
3 Feature Extraction Feature extraction is a difficult and error prone task that typically is performed manually. This process is done through an analysis of the sonar data, and careful selection of characteristics which help to describe the class of the object. Modern methods have looked to reduce this complexity through automatically selecting features from the data through decomposition (e.g. PCA, Wavelet) [1] in order to have a set of features that uniquely describe the object, and can be used directly for training. Methods such as SIFT have been effective as an automated method for feature extraction, and has been applied to both visual and acoustic images to select a set of features for the object [8]. Learning methods are appealing for feature extraction, specifically generative models, as they learn the dominant features of the image they are being trained on, and build an internal representation of what elements the object should be composed of. This allows for an unsupervised approach where many images are shown to the learning method, and the learning method determines the features to be selected and modelled. Furthermore, the generative models have the added advantage that it is possible to see the feature filters which have been learned, which gives the researcher a measure of the progress of the learning. With either method, the goal is the same. We wish to select the most descriptive features for the class to provide the least ambiguous training set to the SVM, allowing it to find easily separable classes, and perform effectively as a target classifier compared to existing manual methods. 3.1 Scale Invariant Feature Transform (SIFT) SIFT [6] is a method for feature extraction that is invariant to scale, orientation and distortions. Briefly, SIFT convolves a Difference of Gaussians (DoG) filter with the input at multiple scales to detect image gradients (edges). In the case of dense SIFT feature extraction as employed here, a 128-dimensional feature vector is generated for each overlapping window of the input image by computing orientation histograms with 8 bins for each of 16 subregions (4x4) of the window. For a more detailed description, the interested reader is referred to [6]. We employ dense SIFT feature extraction and explore window sizes ranging from 12 to 24 pixels wide (i.e. spatial bins from 3 to 6 pixels), spaced from 4 to 12 pixels apart. Similar to our cRBM experiments described later, best results were obtained with the maximum possible window size (24x24) for this data (93x24), spanning the full width of the image. Using a windows spacing of 6 pixels results in 13 windows spaced along the length of the image, with 128 features per window for 1664 features per image (roughly equal in size to the cRBM representation).
178
P. Hollesen, W.A. Connors, and T. Trappenberg
3.2 Restricted Boltzmann Machines (RBMs) The RBM [9] is an energy-based, generative model that can learn to represent the distribution of implicit features of the training data and generate examples thereof. An RBM consists of two layers of nodes, forming the visible and hidden layers. Each layer is fully connected to the others, but is restricted in that there are no connections between nodes within a layer. The energy of the joint configuration of visible and hidden units given the connections between them (ignoring biases for simplicity) is given by V
E(v, h) = − ∑
H
∑ vi h j wi j
(1)
i=1 j=1
where v and h are the states of the visible (input) and hidden units, respectively, and w is the connection strengths between each visible and each hidden unit. Stacks of RBMs can be learned in a greedy, layer-wise fashion, with the output of the previous layer providing the input to the next, forming a Deep Belief Network (DBN) [9]. This enables higher-layer nodes to learn progressively more abstract regularities in the input. RBM training is accomplished with the Contrastive Divergence (CD) algorithm [10] which lowers the energy (i.e., raises the probability) of the data observed on the visible units and raises the energy of reconstructions of the data produced by the model:
Δ wi j ∝ vi h j data − vi h j model
(2)
Using CD, the RBM learns a generative model of the input in a purely unsupervised fashion by measuring the discrepency between the data and the model’s reconstructions then “correcting" the system by slightly altering the weights to minimize reconstruction errors. We can also regularize the learned model with a sparse representation by decreasing the weights of nodes whose activity exceeds a prescribed sparsity level, s [11]:
Δ w j ∝ s − h j
(3)
where h j is the expected probability of activation which is computed as a decaying average of the activity of that unit over training examples. This has the added benefit of increasing the weights of nodes whose activity is below the target threshold, thus reintegrating nodes whose random initial conditions lead to them being suppressed by the network (“dead nodes"). While this regularization may lead to greater reconstruction error by forcing the network to represent the input with a smaller proportion of nodes, the resulting hidden representation is likely to be more interpretable by subsequent layers or classifiers. 3.3 Convolutional Restricted Boltzmann Machines (cRBMs) In the cRBM model [12] each hidden node, rather than being fully connected to every input element as in a standard RBM, is connected to only a small, localized region of the image which is defined by the researcher. Furthermore, these connections are shared
Comparison of Learned versus Engineered Features
179
by a group of hidden nodes which are collectively connected to every input region. This architecture enables the computationally efficient convolution operation to be used to generate each groups’ activation. If the region of the input image that each node of the cRBM is connected to is significantly smaller than the total input image, as we expect when the input is high resolution imagery, then the cRBM requires orders of magnitude fewer parameters for a similar representation size, since weights are shared by all nodes in a group. This is especially useful when patterns recur in different regions of the input, since any knowledge learned about this pattern is automatically transfered to all input regions. By pooling adjacent hidden activation within groups, either with the commonly used maximum pooling or the probabilistic maximum pooling method [12], we can attain a degree of translational invariance while also keeping the size of the hidden representation within reasonable bounds. If maximum pooling is used, then we calculate the probability of activation of each node in a pooling window by applying the logistic function to the feedforward activation. In the probablistic maximum pooling method, each pooling window is sampled multi-nomially, so that only one hidden node in a window can be on, and the pooling node is off only if all hidden nodes in its window are off, according to Eq. (4) and (5): P(hki, j = 1|v) = P(pkα = 0|v) =
exp(I(hki, j )) 1 + ∑i , j ∈Bα exp(I(hki , j )) 1 1 + ∑i , j ∈Bα exp(I(hki , j ))
(4) (5)
where hki, j is a hidden node in pooling window Bα receiving feedforward input I(hki, j ) resulting from the convolution of the kth filter with the input, and pkα is the pooling node for that window. The representation of each group of hidden nodes is then convolved with its filter to get that group’s reconstruction of the input. Summing over all groups’ reconstructions yields the networks reconstruction of the input used for CD learning. For the present experiments we restricted ourselves to a single-layer cRBM. The parameters having the largest impact on classification performance are the filter size and number of filters. Through experimentation, best representations were obtained using 50 filters with width one less than the image width (i.e. 23 × 23). After probabilistic maximum pooling with a 2×2 window size, this filter width results in a 50 filter by 36 height representation, with the width collapsed to 1 (1800 dimensional). That is, the width of the image is collapsed in the cRBM’s representation by the convolution operation and pooling. This can be seen as a compromise between the conventional and convolutional approach, with minimal transfer of knowledge horizontally. The dataset was amenable to this severe reduction in representation width because targets were centered in the image, with the pooling layer providing sufficient invariance to the small differences in position. Based on research by Nair and Hinton using the NORB dataset[9], real-valued images can be used at the visible layer of the RBM if training speed is decreased. Therefore a low learning rate of 0.01 for weights and biases was found to be stable for the real
180
P. Hollesen, W.A. Connors, and T. Trappenberg
valued images and was sufficiently large that learning peaked after 50 epochs through the training set of 228 images. The learning rate for sparsity regularization was initialized at the learning rate for weights, and then increased to 10 times this rate linearly over epochs. This enables the network to explore representations early in learning since many nodes are active (and thus learning), and then gradually get driven to the desired sparsity level. The target sparsity giving best results was dependent on the representation size and thus the number and size of filters employed. For the 50, 23 × 23 filter network for which results are reported, a target sparsity of 0.01 in the hidden layer (0.04 in the final pooled representation) yielded best results through experimentation.
4 Results The original images were 466×119 pixels, though some images had missing rows toward the bottom of the image which were detected and filled with the image mean intensity value. Each image was downsampled by a factor of five (to 93x24 pixels) to remove some noise, provide a more computationally tractable representation size, and decrease in-class variation. Image intensities were then normalized to have zero mean and unit standard deviation. Normalization was done per image because the dynamic range varied substantially from one image to the next. Normalizing per pixel across the training set, as is more common, rendered a significant proportion of images undistinguishable from the background due to their low dynamic range. Classification performance was determined via ten-fold cross validation. This method partitions the training set of data into 10 subsets, where one is retained for the validation of the classification model, and nine subsets are used for training. This process is repeated ten times, where each of the subsets is used once as the validation set. 2121 clutter, 10 mine-like rock and 10 of each type of mine was reserved for testing and the model was trained on the remaining data (50 clutter, 37 mine-like rock, 55 cylinders, 59 truncated cones, 27 wedges). The small proportion of available clutter examples used for training was chosen so the total clutter in the training set approximated the mean total of the mine-like objects. Classification was performed with an SVM using the libSVM [13] software library. Grid searches were performed for optimal parameters for both a linear and radial basis function kernels. For both the SIFT and cRBM feature vectors, the linear kernel gave superior results and was robust over a wide range of the SVM kernel cost parameter. 4.1 Convolutional RBM We examined the representation learned by RBMs by the reconstruction of the input and its learned filters. Figure 2 provides a sample of sonar images and their reconstructions by one of the cRBMs trained in the course of cross-validation. The reconstructions are significantly smoothed, but with the smoothing generally respecting object boundaries as in nonlinear diffusion. After learning filters from the training set, the activation probabilities of the convolutional RBM’s hidden units were generated for both the training and test sets and passed to the SVM for training and classification. To show not only the correct classification
Comparison of Learned versus Engineered Features
181
Fig. 2. Sample sonar images [7] (top) and reconstructions produced by the convolution RBM model which resulted in the best classification performance (bottom). The reconstructions are significantly smoothed and the highlight of the object somewhat filled in.
performance but also the missed classifications, a confusion matrix (Table 1) was computed for comparison with SIFT and template matching [14]. Out of 300 target views (3 types of targets, 10 of each type of target, 10 crossvalidations), there were 5 false negatives, and out of 21310 views of non-targets, 1035 false positives. This yields a sensitivity to mines of .983 ± .024 showing a high rate of correct target classification, and a specificity of .954 ± .012. While most categories had this high level of classification accuracy, it is interesting to note that a large proportion of wedges (23%) were mis-labelled as truncated cone, due to both the similarity of their appearance in some of the sonar data and the poor representation of wedges in the dataset (37 wedges vs 69 truncated cones). This led to a poor sensitivity for wedges specifically but did not impact the sensitivity to mines in general. 4.2 SIFT As the SIFT method does not require training, the algorithm was applied to each training image, generating a 1664-dimensional feature vector (128 features for each of 13 24 × 24 windows spaced 6 pixels apart along the length of the image). This served as the
182
P. Hollesen, W.A. Connors, and T. Trappenberg Table 1. Confusion matrix for SVM trained on cRBM features CONFUSION clutter cylinder trunc. cone wedge mine-like rock clutter 0.949±0.013 0.031±0.011 0.006±0.003 0.011±0.003 0.003±0.003 cylinder 0.030±0.048 0.970±0.048 0±0 0±0 0±0 trunc. cone 0.01±0.032 0±0 0.980±0.042 0.010±0.032 0±0 wedge 0±0 0.020±0.042 0.230±0.125 0.740±0.127 0.010±0.032 mine-like rock 0±0 0±0 0.080±0.140 0±0 0.920±0.140 Table 2. Confusion matrix for SVM trained on dense SIFT features CONFUSION clutter cylinder trunc. cone wedge mine-like rock clutter 0.932±0.010 0.011±0.004 0.019±0.006 0.031±0.009 0.008±0.002 cylinder 0.010±0.032 0.980±0.063 0±0 0.010±0.032 0±0 trunc. cone 0±0 0±0 0.950±0.071 0.030±0.068 0.020±0.042 wedge 0.020±0.042 0.030±0.048 0.460±0.158 0.450±0.135 0.040±0.052 mine-like rock 0±0 0.020±0.063 0.040±0.070 0.010±0.032 0.930±0.082
input for training and testing the SVM. The following confusion matrix in Table 2 illustrates the correct and incorrect classifications using the SIFT features. The SIFT features resulted in similar performance to those of the cRBM but with slightly more false positives. The sensitivity to mines was .970 ± .025 which shows strong classification performance and the specificity was .944 ± .008. The biggest difference in the performance with respect to the cRBM was that there was significantly more confusion between truncated cones and wedges. This is an interesting result as it shows that learned features are in particular outperforming in the cases that are difficult to classify.
5 Discussion Overall the results of this work are encouraging and merit further research into the application of learning methods to sonar imagery and mine classification in particular. Both the SIFT method and the cRBM methods were comparable in performance, with the cRBM performing slightly better than the SIFT feature extraction method. As a basis of comparison, we include below in Table 3 the results from a normalized shadow and echo template-based cross-correlation method [14] which has proven highly effective at classifying targets. These templates are designed for a specific sensor and specific templates are generated for different ranges and therefore are an excellent baseline for learning methods to be compared against. As shown in earlier work [4], the RBM/DBN model can effectively extract features of mine-like targets and classify them using traditional side scan sonar data, however this method showed poor performance using the higher resolution data from the SAS sensor (results not shown). We believe that this is caused by the very large dynamic range in the sensor leading the DBN to learn features of the background (noise) distribution at the expense of modelling the object highlight. Although the increase in
Comparison of Learned versus Engineered Features
183
Table 3. Confusion matrix for template-matching method [14] NSEM non-mine cylinder trunc. cone wedge non-mine 0.94 0.01 0.02 0.03 cylinder 0.03 0.97 0 0 trunc. cone 0.04 0 0.96 0 wedge 0.08 0 0 0.92
resolution in the sensor provides a richer set of detailed features of the object being learned, it also has the downside that the learning machines have a tendency to try to model and classify this noise rather than just the object. The cRBM model with enforced sparsity was beneficial in this regard, as the smaller parameter space and sparsity regularized the model and thereby limited the modelling of the background features. To illustrate this, the reconstructions in Figure 2 show a form of smoothing in areas of the image where no target features were present. The filter sizes providing best results spanned the full width of the image (minus 1 in the case of the cRBM due to 2 × 2 pooling). This results in an architecture more similar to a conventional network in the horizontal direction but convolutional in the vertical direction. While there is significant error in the reconstructions (Figure 2), the hidden representation from which they are produced has a relatively small number of filters (50) given the large filter size, as well as sparse activation, which proves more interpretable for the classifier. Using smaller filters has the benefit of being able to model finer features and transfer this learning horizontally, however it results in a significantly larger representation which the classifier had greater difficulty interpreting. In general, cRBM parameterizations which allow more accurate models of the data in the sense of reconstruction error, either by having smaller filters, more filters, or less sparsity, decrease classification performance since they naturally resulted in more complex representations which are more difficult to model with the SVM. In comparison to the template matching method, the two methods examined in this paper showed comparable performance, with the exception of the wedge shapes, where both methods suffered in comparison to the template based method. Note that the template method utilized a significantly larger set of templates for wedges to compensate for the complexity of its shape and ambiguity with the truncated cone. Since the highlight of many wedges and truncated cones was little more than a strip of light in many instances, these two classes were confused, with the SVM opting to classify most as truncated cones due to their greater prevalence in the dataset. However, the cRBM distinguished them significantly better than SIFT. Examining the raw data, it was observed that a subset of wedges had some of the brightest highlights in the dataset. This feature, which may have been an artifact of this particular dataset, was likely captured by the the RBM representation but removed by SIFT in its attempt to create an illumination invariant representation. This would explain the cRBM’s better performance in distinguishing the wedges and truncated cones whose shape representation was very similar in the sonar image. This effect highlights the benefit of using learned filters rather than engineered features from neighbouring application domains, since features which may be uninformative in one domain (in this case, the illumination of a particular feature) may be informative in the other.
184
P. Hollesen, W.A. Connors, and T. Trappenberg
6 Outlook The results from both the cRBM and SIFT models are encouraging, but also highlight the need for further research. Distinguishing wedges from truncated cones, in particular, proved challenging for our models and demands further attention. In general, the detection and classification of objects from sonar imagery could potentially benefit from additional pre-processing or extensions to the two models, as described below. As noted in the data preparation section, the raw SAS data is organized as a set of complex numbers that describe both the amplitude and phase of the reflected sound intensities. For the purposes of this paper, the phase element was removed, and just the raw amplitudes considered. Although this phase element is stripped, it is likely that there is some coherent features in the phase information, which could help distinguish non-image related features such as material. If this feature could be extracted, it could be supplied as another feature for classification, or as a method to limit the false alarms during detection and classification phases. In the Spatial Pyramid Matching (SPM) method [15], dense SIFT features are extracted as in the method we employed. SIFT feature vectors are subsequently vector quantized, and then histograms at multiple levels of resolution (whole image, quater image, ...). A histogram intersection ( χ 2 ) kernel is then employed to classify the histogram representation. This has been very successful in object recognition tasks such as Caltech 101 used in [15]. Preliminary experiments with this method offered poor performance on this dataset, but more work needs to be done in exploring the many parameters of this model to determine if it can be successfully applied to classification of bottom objects in SAS imagery. Our experiments with stacked layers of cRBMs yielded poor performance on the classification task. However, experimentation was hindered by the large computational burden imposed by convolving many filters with many layers of hidden representation. As we and other groups develop software to transfer the computation of these expensive operations to graphics processing units (GPU), this architecture will become much easier to explore and we expect that higher layers will achieve greater invariance to noise and small intra-class variations, as well as uncover more complex regularities in the training data.
Acknowledgements The authors would like to acknowledge the NATO Undersea Research Center (NURC) for the use of the SAS data for this paper.
References 1. Chapple, P.: Automated detection and classification in high-resolution sonar imagery for autonomous underwater vehicle operations. Technical report, Defence Science and Technology Organization (2008) 2. Fawcett, J., Crawford, A., Hopkin, D., Myers, V., Zerr, B.: Computer-aided detection of targets from the CITADEL trial Klein sonar data. Defence Research and Development Canada Atlantic TM 2006-115 (November 2006), pubs.drdc.gc.ca
Comparison of Learned versus Engineered Features
185
3. Fawcett, J., Crawford, A., Hopkin, D., Couillard, M., Myers, V., Zerr, B.: Computer-aided classification of the Citadel Trial sidescan sonar images. Defence Research and Development Canada Atlantic TM 2007-162 (2007) pubs.drdc.gc.ca 4. Connors, W., Connor, P., Trapperberg, T.: Detection of mine like objects using restricted boltzmann machines. In: Proceedings of the 23rd Canadian Conference on AI (2007) 5. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004) 7. Bellettini, A., Pinto, M.: Design and experimental results of a 300 kHz synthetic aperture sonar optmized for shallow-water operations. IEEE Journal of Oceanic Engineering 34, 285– 293 (2008) 8. Myers, V., Fortin, A., Simard, P.: An automated method for change detection in areas of high clutter density using sonar imagery. In: Proceedings of the UAM 2009 Conference, Nafplio, Greece (2009) 9. Nair, V., Hinton, G.E.: Implicit mixtures of restricted boltzmann machines. In: NIPS, pp. 1145–1152 (2008) 10. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002) 11. Hinton, G.E.: A practical guide to training restricted boltzmann machines. Technical Report UTML TR 2010-003, University of Toronto (2010) 12. Lee, H., Grosse, R., Ranganath, R., Ng, A.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009) 13. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin.libsvm 14. Myers, V., Fawcett, J.: A template matching procedure for automatic target recognition in synthetic aperture sonar imagery. IEEE Signal Processing Letters 17(7), 683–686 (2010) 15. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2169–2178 (2006)
Learning Probability Distributions over Permutations by Means of Fourier Coefficients Ekhine Irurozki, Borja Calvo, and Jose A. Lozano Intelligent Systems Group, University of the Basque Country, Spain {ekhine.irurozqui,borja.calvo,ja.lozano}@ehu.es
Abstract. An increasing number of data mining domains consider data that can be represented as permutations. Therefore, it is important to devise new methods to learn predictive models over datasets of permutations. However, maintaining probability distributions over the space of permutations is a hard task since there are n! permutations of n elements. The Fourier transform has been successfully generalized to functions over permutations. One of its main advantages in the context of probability distributions is that it compactly summarizes approximations to functions by discarding high order marginals information. In this paper, we present a method to learn a probability distribution that approximates the generating distribution of a given sample of permutations. In particular, this method learns the Fourier domain information representing this probability distribution. Keywords: Probabilistic modeling, learning, permutation, ranking.
1
Introduction
Permutations and orders appear in a wide variety of real world combinatorial problems such as multi object tracking, structure learning of Bayesian networks, ranking, etc. Exact probability representation over the space of permutations of n elements is intractable with the exception of very small n, since this space has size n!. However, different simplified models for representing or approximating probability distributions over a set of permutations can be found in the literature [2], [3], [5]. One way to represent probability distributions over permutations is the Fourier-based approach. This is based on a generalization for permutations of the well-known Fourier transform in the real line. Permutations form an algebraic group under the composition operation, also known as the symmetric group, so we will use both expressions, permutations and symmetric group, interchangeably throughout this paper. Although the use of the Fourier transform for representing functions over permutations is not new, this topic has once again come to the attention of the researchers, partly due to a framework recently provided by [5] and [7] which allows to carry out inference tasks entirely in the Fourier domain. Moreover, new concepts such as the probability independence [4] have been introduced. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 186–191, 2011. c Springer-Verlag Berlin Heidelberg 2011
Learning Probability Distributions over Permutations
187
In this paper, we focus on the problem of learning the generating distribution of a given sample of permutations, particularly we present a method for learning a limited number of Fourier coefficients that best approximates it (i.e., that maximize the likelihood of the sample). The first attempt to learn a probability distribution by means of the Fourier coefficients was presented in [6]. The authors concentrated on getting a consensus ranking and a probability distribution under constrained sensing, when the available information is limited to the first order marginals. However, to the best of our knowledge, this work is the first attempt to do it in a general way. The rest of the paper is organized as follows. The next section introduces the basis of the Fourier transform over permutations. In Section 3 we detail how we formulate the maximum likelihood method. Section 4 presents the experimental results of several tests. In Section 5, we conclude the paper.
2
The Fourier Transform on the Symmetric Group
Since it is out of the scope of this paper to be a proper tutorial on both the Fourier transform (FT) on the symmetric group and on representation theory, we just give some ideas for intuition and refer the interested reader to [1] and [8] for further discussion. Formally, a permutation is defined as a bijection of the set {1,...,n} into itself which can be written as σ = [σ(1), ..., σ(n)]. The FT on the symmetric group, which is a generalization of the FT over the real line, decomposes a function over permutations into n! real numbers which are known as Fourier coefficients. These coefficients are grouped in matrices which are in turn ordered by -following the analogy with the FT over the real line- frequency {fˆρ1 , ..., fˆρl }. The original function can be recovered from the Fourier coefficients by using the inversion theorem, which is stated as follows: f (σ) =
1 dρλ T r[fˆρTλ · ρλ (σ)] |Sn |
(1)
λ
where ρλ (σ) denote the real valued irreducible representation matrices and dρλ their dimension [8]. In the context of probability distributions, the FT has a very interesting property: Each matrix of Fourier coefficients stores the information corresponding to a particular marginal probability1 . Moreover, the (k − 1)-th marginal probabilities can be obtained by multiplying a matrix consisting of the direct sum of the k lowest frequency matrices of coefficients, M , by some matrices Ck which are precomputed. The size of M for large values of k (i.e., for high order statistics) makes the storage of this matrix prohibitive. However, maintaining such matrix and computing the multiplication is computationally cheap for small values 1
While the first order marginal expresses the probability of item i being at position j, higher order marginals capture information such as the probability of items (i1 , i2 , ..., ik ) being at positions (j1 , j2 , ..., jk ).
188
E. Irurozki, B. Calvo, and J.A. Lozano
of k. Therefore, it is possible to approximate functions by discarding the coefficient matrices at high frequencies. Such approximations smooth the original probability distribution, bringing it closer to the uniform distribution.
3
Learning Probability Distributions over the Fourier Domain
In this section we describe our proposed formulation for learning the Fourier coefficients from a given sample of permutations. Our proposal consists of finding the Fourier coefficients that maximize the likelihood given a sample of permutations. Actually, we are interested in obtaining an approximation which considers only the (k − 1)-th lowest marginal probabilities. In order to learn such an approximation, the Fourier coefficients in the formulation are restricted to those in the k lowest frequency matrices of the FT, {fˆρ1 , ..., fˆρk }. Maximizing the likelihood of a sample {σ1 , ..., σt } given the model in equation 1 means solving the following nonlinear optimization problem: ˆ 1 , ..., σt |fˆρ1 , ..., fˆρ ) (fˆρmle , ..., fˆρmle ) = arg max L(σ k λ1 λ k
fˆρ1 ,...,fˆρk
t k 1 = arg max dρλ T r[fˆρTλ · ρλ (σi )] fˆρ ,...,fˆρ i=1 |Sn | 1
k
λ=1
Unfortunately, not every set of Fourier coefficients leads to a valid probability distribution. Compactly describing the coefficients of a valid distribution is still an open problem [5]. We will restrict the search space by adding of some constraints that forbid searching in regions of the space where no coefficient representing a valid distribution can be found. We have considered two kinds of constraints. The first kind of constraint ensures a positive probability for each permutation in the sample. The second kind of constraint ensures that the Fourier coefficients take values between the maximum and the minimum values of the irreducible representations that multiply it. That is: minσ ([ρλ (σ)]ij ) ≤ [fˆρλ ]ij ≤ maxσ ([ρλ (σ)]ij ) The Fourier coefficients obtained by maximizing the likelihood restricted to these constraints correspond to a distribution whose sum is guaranteed to be 1. However, this does not ensure a valid probability distribution, as it is possible to have negative ’probabilities’. If so, we perform a normalization process. Let m be the minimum probability value associated to a permutation. This process consists of adding to every value of the probability distribution the absolute value of m and normalizing it. Note that if we added a first kind of constraint for each σ ∈ Sn the estimated distribution would be valid.
Learning Probability Distributions over Permutations
4
189
Experiments
In this section we will show the performance of the proposed formulation. Our aim is to demonstrate that the accuracy of the estimated distributions increases as the sample size grows and higher order marginals are learned. 4.1
Experimental Setup
In order to evaluate our approach on the above described statements, we have designed the following experimental framework. First of all, a probability distribution is randomly generated and this is used to draw several permutation samples. From these samples, the proposed algorithm learns the Fourier coefficients, and the distributions corresponding to these coefficients are calculated. Finally, the Kullback-Leibler divergences between the reference and the resulting estimated distributions are calculated. We also propose a comparison test based on Monte Carlo techniques. The test consists of sampling a large number of random distributions and measuring the Kullback-Leibler divergence between the reference and each of the random distributions. The reference and the random distributions are generated by sampling a Dirichlet distribution. In this way, the generation of each distribution requires n! hyper-parameters α1 , ..., αn! . We have set these hyper-parameters, for every distribution, as α1 = α2 = ... = αn! = α, where α is uniformly drawn from the interval [0.05, 0.25]. It seems reasonable to think that, by learning higher order marginals, it will be possible to more accurately approximate the reference distribution. In order to prove this intuition, the Fourier coefficients corresponding to three different marginals have been learned for each test, that is, the coefficients at the lowest 2, 3 and 4 frequency matrices. The tests have been made over the set of permutations of 6 and 7 elements, S6 and S7 respectively. For each different Sn , three sample sizes have been defined which are 5%, 10% and 25% of n! for S6 and 1%, 5% and 10% of n! for S7 . Also, for each n and sample size, ten different samples are randomly generated and the average results are computed. The number of random distributions for the comparison test is 100,000 and their divergences with the reference distributions are used to draw a histogram. The resulting constrained nonlinear optimization problems have been solved using the fmincon function of MATLAB. 4.2
Results
Figures 1a, 1b and 1c show the results of estimating a sample of 5%, 10% and 25% of n! respectively. Particularly, these figures show the Kullback-Leibler divergence between the reference and the estimated distributions, and the reference and random distributions. The first point to consider in each figure is that the divergence between the reference and the random distributions span a
E. Irurozki, B. Calvo, and J.A. Lozano
3000 2000 1000
4000 3000 2000 1000
0 0
10
20 30 KL divergence
40
2.54405
4000 3000 2000 1000
0 0
50
2.44395
2.63829
5000 Number of distributions
4000
2.47655
2.39995
2.5738
6000
5000 Number of distributions
Number of distributions
5000
2.65318
6000 2.51719
6000
2.3352
190
10
20 30 KL divergence
40
0 0
50
10
20 30 KL divergence
40
50
(a) Sample size of 5%, S6 (b) Sample size of 10%, S6 (c) Sample size of: 25%, S6
2000 1000
4000 3000 2000 1000
5
10
15 20 KL divergence
25
30
(d) Sample size of 1%, S7
35
0 0
2.18512
2.1317
2.08967
2.17647
5000 Number of distributions
3000
2.13219
Number of distributions
5000
4000
0 0
6000 2.10149
2.23045
Number of distributions
5000
2.19262
6000 2.17366
6000
4000 3000 2000 1000
5
10
15 20 KL divergence
25
30
(e) Sample size of 5%, S7
35
0 0
5
10
15 20 KL divergence
25
30
35
(f) Sample size of 10%, S7
Fig. 1. Kullback-Leibler divergence between the reference and estimated distributions, and the reference and the random distributions for S6 and S7
wide interval, the higher concentration being in the first half of the range. However, none of the random distributions is closer to the reference distribution than any of those obtained by learning the Fourier coefficients, which are plotted with a vertical line. Note that each line represents the average divergence of ten distributions obtained from ten different samples of the same size. The estimated distributions are significantly better than any random distribution. The three lines correspond to the distributions obtained by estimating the Fourier coefficients at the lowest 2, 3 and 4 frequency matrices. Since the differences cannot be clearly appreciated in the plots, a zoom over them is done at the top of each figure. In every zoomed figure the line on the right corresponds to the estimation of the lowest order marginals (k = 2) and the line on the left to the estimation of highest order marginals considered (k = 4). This means that as the number of learned Fourier coefficients grows, the resulting distribution gets closer to the reference distribution. Figures 1d, 1e and 1f show a similar performance on the group S7 . Moreover, one can see that, as the number of elements in the set of permutations grows, the divergences between the reference and the random distributions quickly increase, while the divergence of the learned distributions are quite stable.
5
Conclusions and Future Work
In this paper we propose a novel method for learning probability distributions from a set of permutations. The model for representing such distributions is a
Learning Probability Distributions over Permutations
191
Fourier-based approach. We have described a formulation that, by maximizing the likelihood function, learns the Fourier coefficients that best represent the probability distribution of a given sample, considering only the first k marginals. Although our approach can only be used with low values of n, it can be useful when combined with other learning approaches based on independence [4]. With this in mind, in order to learn the generating probability distribution of a given sample it is possible to, first, find the items in each of the independent (or nearly independent) factors, and then learn the distribution of each of the subsets of items by using our proposed formulation. In this way, we will deal with smaller sets of items, making it possible to work with distributions that are otherwise unaffordable.
Acknowledgments This work has been partially supported by the Saiotek and Research Groups 2007-2012 (IT-242-07) programs (Basque Government), TIN2008-06815-C02-01 and TIN2010-14931 MICINN projects and COMBIOMED network in computational biomedicine (Carlos III Health Institute). Ekhine Irurozki holds the grant BES-2009-029143 from the MICINN.
References 1. Diaconis, P.: Group representations in probability and statistics. Institute of Matematical Statistics (1988) 2. Fligner, M.A., Verducci, J.S.: Distance based ranking models. Journal of the Royal Statistical Society 48(3), 359–369 (1986) 3. Helmbold, D.P., Warmuth, M.K.: Learning permutations with exponential weights. Journal of Machine Learning Research (JMLR) 10, 1705–1736 (2009) 4. Huang, J., Guestrin, C.: Learning hierarchical riffle independent groupings from rankings. In: International Conference on Machine Learning (ICML 2010), Haifa, Israel (June 2010) 5. Huang, J., Guestrin, C., Guibas, L.: Fourier theoretic probabilistic inference over permutations. Journal of Machine Learning Research (JMLR) 10, 997–1070 (2009) 6. Jagabathula, S., Shah, D.: Inferring rankings under constrained sensing. In: Advances in Neural Information Processing Systems 21, Proceedings of the TwentySecond Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, pp. 753–760 (2008) 7. Kondor, R., Howard, A., Jebara, T.: Multi-object tracking with representations of the symmetric group. In: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, San Juan, Puerto Rico (March 2007) 8. Serre, J.P.: Linear Representations in Finite Groups. Springer, Heidelberg (1977)
Correcting Different Types of Errors in Texts Aminul Islam and Diana Inkpen University of Ottawa, Ottawa, Canada {mdislam,diana}@site.uottawa.ca
Abstract. This paper proposes an unsupervised approach that automatically detects and corrects a text containing multiple errors of both syntactic and semantic nature. The number of errors that can be corrected is equal to the number of correct words in the text. Error types include, but are not limited to: spelling errors, real-word spelling errors, typographical errors, unwanted words, missing words, prepositional errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation). Keywords: Text Error Correction, Detection, Unsupervised, Google Web 1T 5-grams.
1
Introduction
Most approaches to text correction are for only one or at best for a few types of errors. To the best of our knowledge, there is no fully-unsupervised approach that corrects a text having multiple errors of both syntactic and semantic nature. Syntactic errors refer to all kinds of grammatical errors. For example, in the sentence, “Our method correct real-word spelling errors.”, there is an error of syntactic nature in subject-verb agreement, whereas, in the sentence, “She had a cup of powerful tea.”, the word ‘strong’ is more appropriate than the word ‘powerful’ in order to convey the proper intended meaning of the sentence, based on the context. The latter is an example of a semantic error. In this paper, a more general unsupervised statistical method for automatic text error detection and correction, done in the same time, using the Google Web 1T 5-gram data set [1] is presented. The proposed approach uses the three basic text correction operations: insert, delete, and replace. We use the following three strict assumptions for the input text that needs to be corrected: (1) The first token is a word1 . (2) There should be at least three words in an input text. (3) There might be at most one error in between two words. We also assume that there might be at most one error after the last word. We also use the following weak assumption: (4) We try to preserve the intended semantic meaning of the input text as much as possible. 1
Whenever we use only the term ‘word’ without an adjective (e.g., correct or incorrect), we imply a correct word.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 192–203, 2011. c Springer-Verlag Berlin Heidelberg 2011
Correcting Different Types of Errors in Texts
2
193
Related Work
Some approaches consider spelling correction as text correction. An initial approach to automatic acquisition for context-based spelling correction was a statistical language-modeling approach using word and part-of-speech (POS) n-grams [2–5]. Some approaches in this paradigm use Bayesian classifiers and decision lists [6–8]. Other approaches simply focus on detecting sentences that contain errors, or computing a score that reflects the quality of the text [9–14]. In other text correction approaches, the prediction is typically framed as a classification task for a specific linguistic class, e.g., prepositions, near-synonym choices, or a set of predefined classes [15, 16]. In some approaches, a full syntactic analysis of the sentence is done to detect errors and propose corrections. We categorize this paradigm into two groups: those that constrain the rules of the grammar [17, 18], and those that use error-production rules [19–22]. [23] presents the use of a phrasal Statistical Machine Translation (SMT) techniques to identify and correct writing errors made by ESL (English as a Second Language) learners. The work that is closely related to ours is that of Lee’s [24], a supervised method built on the basic approach of template-matching on parse trees. To improve recall, the author uses the observed tree patterns for a set of verb form usages, and to improve precision, he utilizes n-grams as filters. [25] trains a maximum entropy model using lexical and POS features to recognize a variety of errors. Their evaluation data partially overlaps with that of [24] and our paper.
3
Proposed Method
Our proposed method determines some probable candidates and then sorts those candidates. We consider three similarity functions and one frequency value function in our method. One of the similarity functions, namely the string similarity function, is used to determine the candidate texts. The frequency value function and all the other similarity functions are used to sort the candidate texts. 3.1
Similarity and Frequency Value Functions
Similarity between Two Strings. We use the same string similarity measure used in [26], with the following different normalization from [27]: v1 =
2 × len(LCS(s1 , s2 )) len(s1 ) + len(s2 )
v2 =
2 × len(MCLCS1(s1 , s2 )) len(s1 ) + len(s2 )
2 × len(MCLCSn(s1 , s2 )) 2 × len(MCLCSz (s1 , s2 )) v4 = len(s1 ) + len(s2 ) len(s1 ) + len(s2 ) The similarity of the two strings, S1 ∈[0, 1] is: S1 (s1 , s2 ) = α1 v1 + α2 v2 + α3 v3 + α4 v4 (1) Here, len calculates the length of a string, LCS, MCLCS1 , MCLCSn, and MCLCSz calculate the Longest Common Subsequence, Maximal Consecutive LCS starting v3 =
194
A. Islam and D. Inkpen
at character 1, starting at character n, and ending at the last character between two strings, respectively. α1 , α2 , α3 , α4 are weights and α1+α2+α3+α4 = 1. We heuristically set equal weights for most of our experiments2 . Common Word Similarity between Texts. If two texts have some words in common, we can measure their similarity based on the common words. We count the number of words in common between the text to correct and a candidate corrected text, normalizing the count by the size of both texts. Let us consider a pair of texts, T1 and T2 that have m and n tokens, with δ tokens in common. Thus, the common word similarity, S2 ∈[0, 1] is: S2 (T1 , T2 ) = 2δ/(m + n)
(2)
Non-Common Word Similarity. If the two texts have some non-common words, we can measure how similar the two texts are based on their non-common words. If there are δ tokens in T1 that exactly match with T2 , then there are m−δ and n−δ non-common words in texts T1 and T2 , respectively, assuming that T1 and T2 have m and n tokens, respectively, and n ≥ m. We remove all the δ common tokens from both T1 and T2 . We construct a (m − δ) × (n − δ) string similarity matrix using Equation 1 and find out the maximum-valued matrix element. We add this matrix element to a list (say, ρ). We remove all the matrix elements which are in the row and column of the maximum-valued matrix element, from the original matrix. We remove the row and column, in order to remove the pair with maximum similarity. This makes the computation manageable: in the next steps, fewer words are left for matching. We repeat these steps until either the current maximum-valued matrix element is 0, or m−δ−|ρ| = 0, or both. We sum up all the elements in ρ and divide by n − δ to get the non-common word similarity, S3 ∈[0, 1): |ρ| S3 (T1 , T2 ) = i=1 ρi /(n − δ) (3) Normalized Frequency Value. We determine the normalized frequency value of a candidate text (how we determine candidate texts is discussed in detail in Section 3.2) with respect to all other candidate texts. A candidate text having higher normalized frequency value is more likely a strong candidate for the correction, though not always. Let us consider, we have n ˜ candidate texts for the input text T : {T1 , T2 , · · · Ti · · · , Tn˜ } T1 = {w11 , w12 , · · · w1j · · · w(1)(m1 ) } T2 = {w21 , w22 , · · · w2j · · · w(2)(m2 ) } ························ Ti = {wi1 , wi2 , · · · wij · · · w(i)(mi ) } ························ Tn˜ = {wn˜ 1 , wn˜ 2 , · · · wn˜ j · · · w(˜n)(mn˜ ) } Here, wij is the jth token of the candidate text, Ti , and mi means that the candidate text Ti has mi tokens. It is important to note that the number of tokens 2
We use equal weights in several places in this paper in order to keep the system unsupervised. If development data would be available, we could adjust the weights.
Correcting Different Types of Errors in Texts
195
each candidate text has may be different from the rest. The number of 5-grams in any candidate text, Ti is mi − 4. Again, let us consider that Fi is the set of frequencies of all the 5-grams that Ti has; fij is the frequency of the jth 5-gram of the candidate text, Ti . That is: F1 = {f11 , f12 , · · · f1j · · · f(1)(m1 −4) } F2 = {f21 , f22 , · · · f2j · · · f(2)(m2 −4) } ························ Fi = {fi1 , fi2 , · · · fij · · · f(i)(mi −4) } ························ Fn˜ = {fn˜ 1 , fn˜ 2 , · · · fn˜ j · · · f(˜n)(mn˜ −4) } Here, {f11 , f21 , · · · fi1 · · · fn˜ 1 }, {f12 , f22 , · · · fi2 · · · fn˜ 2 }, {f1j , f2j , · · · fij · · · fn˜ j } and {f(1)(mi −4) , f(2)(mi −4) , · · · f(i)(mi −4) · · · f(˜n)(mi −4) } are the sets of 5-gram frequencies for all n ˜ candidate texts that are processed in the first step3 , the second step, the jth step, and the (mi − 4)th step, respectively. We calculate the normalized frequency value of a candidate text as the summation of all the 5-gram frequencies of the candidate text over the summation of the maximum frequencies in each step that the candidate text may have. Thus the normalized frequency value of Ti represented as S4 ∈ [0, 1] is: mi −4 mi −4 S4 (Ti ) = j=1 fij / l=1 maxk∈N fkl (4) 3.2
Determining Candidate Texts
Let us consider an input text, that after tokenization has m tokens, i.e., T = {w1 , w2 · · · , wm }. Our approach consists in going from left to right according to a set of rules that are listed in Table 1 and Table 2. We use three basic operations, Insert, Replace and Delete to list these 5-gram rules. We also use No Operation to mean that we do not use any operation, rather we directly use the next token from T to list a 5-gram rule. 5-gram Rules Used in Step 1. Table 1 lists all possible 5-gram rules generated from the said operations and assumptions. We use each of these 5-gram rules to generate a set of 5-grams and their frequencies by trying to match the 5-gram rule with the Web 1T 5-grams. We take the decision of how many candidate 5-grams generated from each 5-gram rule we keep for further processing (say, n ¯ ). The 5-gram Rule #1 in Table 1 says that we take the first five tokens from T to generate a 5-gram and try to match with the Web 1T 5-grams to generate the only candidate 5-gram and its frequency, if there is any matching. In 5-gram Rule #2, we take the first four tokens from T and try to insert each word from a list of words (our goal here is to determine this list of words; it might be empty) 3
By the first step, we mean the step when we process the first possible 5-grams in the input text. Similarly, by the second step, we mean the step when we process the next possible 5-grams (by removing the first token from the 5-grams used in first step and adding an extra word from the input text or other way, which is discussed in detail in Section 3.2) in the input text, and so on.
196
A. Islam and D. Inkpen
in between w1 and w2 to generate a list of 5-grams and try to match with the Web 1T 5-grams to generate a set of 5-grams and their frequencies. We sort these 5-grams in descending order by their frequencies and only keep at most the top n ¯ 5-grams and their frequencies. All I’s and R’s in Table 1 and Table 2 function similar to variables and all wi ∈ T function similar to constants. The 5-gram Rule #9 can generate a list of 5-grams and their frequencies, based on all the possible values of R2 , a set of all replaceable words of w2 . We determine the string similarity between w2 and each member of R2 using (1) and sort the list in descending order by string similarity values and only keep at most n ¯ 5-grams. Table 1. List of all possible 5-gram rules in step 1 Rule# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
5-gram Rule w1 w2 w3 w4 w5 w1 I1 w2 w3 w4 w1 w2 I1 w3 w4 w1 w2 w3 I1 w4 w1 w2 w3 w4 I1 w1 I1 w2 I2 w3 w1 w2 I1 w3 I2 w1 I1 w2 w3 I2 w1 R2 w3 w4 w5 w1 w2 R3 w4 w5 w1 w2 w3 R4 w5 w1 w2 w3 w4 R5 w1 R2 w3 R4 w5 w1 w2 R3 w4 R5 w1 R2 w3 w4 R5 w1 w3 w4 w5 w6 w1 w2 w4 w5 w6 w1 w2 w3 w5 w6 w1 w2 w3 w4 w6 w1 w3 w5 w6 w7 w1 w2 w4 w6 w7 w1 w3 w4 w6 w7 w1 w2 w4 w5 w7 w1 w2 w3 w5 w7 w1 w3 w4 w5 w7
Generated from Rule# 5-gram Rule No Operation 26 w1 w3 I1 w4 w5 27 w1 w3 w4 I1 w5 28 w1 w2 w4 I1 w5 Single Insert 29 w1 w2 w4 w5 I1 30 w1 w3 w4 w5 I1 31 w1 w2 w3 w5 I1 Double Insert 32 w1 w3 w4 w5 R6 33 w1 w2 w3 w5 R6 34 w1 w3 R4 w5 w6 35 w1 w2 w4 R5 w6 Single Replace 36 w1 w3 w4 R5 w6 37 w1 w2 w4 w5 R6 38 w1 R2 w3 w5 w6 Double Replace 39 w1 w2 R3 w4 w6 40 w1 R2 w3 w4 w6 41 w1 I1 w2 R3 w4 42 w1 w2 I1 w3 R4 Single Delete 43 w1 I1 w2 w3 R4 44 w1 R2 w3 I1 w4 45 w1 w2 R3 w4 I1 46 w1 R2 w3 w4 I1 47 w1 I1 w2 w4 w5 Double Delete 48 w1 w2 I1 w3 w5 49 w1 I1 w2 w3 w5
Generated from Single Delete + Single Insert
Single Delete + Single Replace Single Replace + Single Delete Single Single Single Single
Insert + Replace Replace + Insert
Single Insert + Single Delete
Limits for the Number of Steps, ntp . We figure out what maximum and minimum number of steps we need for an input text. Taking the second assumption into consideration, it is obvious that if the value of m is 3 (the number of words would also be 3) then only rules # 6, 7 and 8 can be used to generate 5-grams. 5-grams generated from rule # 7 and 8 can not be used in the next step as after the last word (w3 ); we might have at most one error and all the 5-grams, if any, generated using these rules have this error (i.e., I2 ). 5-grams generated from rules # 6 can be used in the next step (by rule # 2 in Table 2) to test whether we can insert a word in the next step, provided that the previous step generates at least one 5-gram. Thus, if m = 3 we might need at most 2 steps. Now, if m = 4, then for the added word (i.e., w4 ) we need two extra steps to test rules # 5 and 2, in order, on top of the previous two steps (for the first three words), provided that each previous step generates at least one 5-gram.
Correcting Different Types of Errors in Texts
197
That is, each extra token in T needs at most two extra steps. We generalize the maximum number of steps needed for an input text having m tokens as: Max ntp = 2+(m−3)×2 = 2m−4
(5)
Again, the minimum number of steps is ensured if rules # 6 to 8 in step 1 do not generate any 5-gram. This means that, if m = 3, we might need at least 0 steps4 . Now, if m = 4 then for the added word (i.e., w4 ) we need only an extra step to test rule # 5 on top of the previous single step (for the first three tokens). That is, each extra token in T needs at least one extra step, provided that each previous step for each extra token generates at least one 5-gram.5 We generalize the minimum number of steps needed for an input text having m tokens as: Min ntp = m−3 (6) In (5), the maximum number of steps, 2m − 4, also means that the maximum number of tokens possible in a candidate text is 2m. Thus, an input text having 2m tokens can have at most m errors to be handled and m correct words, assuming m ≥ 3 (the second assumption on page 192). Table 2. List of All Possible 5-gram Rules in Step 2 to Step 2m − 4 Rule# 5-gram Rule Generated from Case Number 1 − − − wi wi+1 No Operation 2 − − − wi Ij Single Insert 1: if the last word 3 − − − wi wi+2 Single Delete in step 1 is in T 4 − − − wi Ri+1 Single Replace 5 − − wi Ij wi+1 No Operation 2: if the second last word in step 1 is in 6 − − wi Ri+1 wi+2 No operation T and the last word is either an inserted or a replaced word
5-gram Rules used in Step 2 to 2m −4. Table 2 lists all possible 5-gram rules generated from the said operations and assumptions for step 2 to step 2m−4. We use step 2 (i.e., the next step) only if step 1 (i.e., the previous step) generates at least one 5-gram from 5-gram rules listed in Table 1. Similarly, we use step 3 (i.e., the next step) only if step 2 (i.e., the previous step) generates at least one 5-gram from the 5-gram rules listed in Table 2, and so on. In Table 2, ‘−’ means that it might be any word that is in T , or an inserted word (an instance of I’s), or a replaced word (an instance of R’s) in the previous step. To give a specific example of how we list the 5-gram rules in Table 2, consider that rule #2 (w1 I1 w2 w3 w4 ) in Table 1 generates at least one 5-gram in step 1. We take the last four words of this 5-gram (i.e., I1 w2 w3 w4 ) and add the next word from T (in this case w5 ), in order to form a new rule in step 2 (which is I1 w2 w3 w4 w5 ). The general form of this rule (− − − wi wi+1 ) is listed as rule #1 in Table 2. In step 1, I1 in rule #2 acts like a variable, but in step 2 4
5
We call a step successful if it generates at least one 5-gram. Thus, if we try to generate some 5-grams in step 1 and if we fail to generate any, then the number of step, ntp is 0, though we do some processing for step 1. If we omit the assumption that each previous step for each extra token generates at least one 5-gram, then to determine the Min ntp is very straight forward, it is 0.
198
A. Islam and D. Inkpen
we use only a single instance of I1 , which acts like a constant. We categorize all the 5-grams generated in step 1 (i.e., the previous step) into two different cases. Case 1 groups each 5-gram in step 1 having its last word in T . Case 2 groups each 5-gram in step 1 having its second last word in T , and the last word not in T . We stop when we fail to generate any 5-gram in the next step from all the 5-gram rules of the previous step. Determining the Limit of Candidate Texts. There might be a case when no 5-gram is generated in step 1; this means that the minimum n ˜ possible is 0. Table 1 shows that there are 11 5-gram rules (rules without any I’s or R’s) in step 1 that generate at most one 5-gram per 5-gram rule. It turns out that the remaining 5-gram rules can generate at most n ¯ 5-grams per 5-gram rule. Thus, the maximum number of candidate texts, n ˜ , that can be generated having only a single step (i.e., ntp = 1) is: Max n ˜ =(no. of 5-gram rules in step 1−no. of 5-gram rules in step 1 without any I’s or R’s) × n ¯ + no. of 5-gram rules in step 1 without any I’s or R’s (7) =(49 − 11) × n ¯ + 11 = 38¯ n + 11
(8)
At most 2¯ n + 2 5-grams (rules #1 to 4 in Table 2) can be generated in step 2 from a single 5-gram generated in step 1 having the last word in T . There may be at most 33 such 5-grams in step 1. At most 1 5-gram (rules #5 and 6 in Table 2) can be generated in step 2 from a single 5-gram generated in step 1 having the second last word in T and the last word being either an inserted or a replaced word. There may be at most 16 such 5-grams in step 1. The maximum number of candidate texts, n ˜ , that can be generated having two steps (i.e., ntp = 2) is: Max n ˜ = 33(2¯ n + 2) + 16 × 1 We generalize Max n ˜ for different values of ntp as: ⎧ 38¯ n+11 ⎪ ⎪ ⎪ ⎪ ⎨ 33×20(2¯ n+2)+16 Max n ˜≈ 1 33×2 (2¯ n+2)+66×20n ¯ ⎪ ⎪ ⎪ · · · · · · · · · ··························· ⎪ ⎩ ntp −2 (2¯ n+2)+66×2ntp−3 n ¯ 33×2
(9) if step = 1 if step = 2 if step = 3
(10)
if step = ntp
Simplifying (10):
⎧ ⎪ n+11 ⎨38¯ Max n ˜ ≈ 66¯ n+82 ⎪ ⎩ ntp −3 (198¯ n+132) 2
if step = 1 if step = 2 if step ≥ 3
(11)
Theoretically, Max n ˜ seems to be a large number, but practically n ˜ is much smaller than Max n ˜ . This is because not all theoretically possible 5-grams are in the Web 1T 5-grams data set, and because fewer 5-grams generated in any step have an effect in all the subsequent steps. Forming Candidate Texts. Algorithm 1 describes how a list of candidate texts can be formed from the list of 5-grams in each step. That is, the output of
Correcting Different Types of Errors in Texts
199
Algorithm 1. forming candidate texts input : ntp , list of 5-grams in each step output: candidate list 1 candidate list ← N ULL 2 for each 5-gram of step 1 do 3 k←1 candidate text[k] ← 5-gram of step 1 4 for i ← 2 to ntp do 5 j←1 6 7 for each k do for each 5-gram of step i do 8 next 5-gram← 5-gram of step i 9 temp candidate text[j] ← candidate text[k] 10 str1 ← last four words of temp candidate text[j] 11 str2 ← first four words of next 5-gram 12 13 if str1 = str2 then temp candidate text[j] ← temp candidate text[j] . last 14 word of next 5-gram /* ‘.’ is to concatenate */ 15 end increment j 16 17 end 18 end decrement j 19 for each j do 20 candidate text[j] ← temp candidate text[j] 21 22 end k←j 23 24 end 25 for each k do candidate list ← candidate list + candidate text[k] 26 27 end 28 end Algorithm 1 is {T1 , T2 , · · · Ti · · · , Tn˜ }. The algorithm works as follows: Taking the last four words of each 5-gram in step 1, it tries to match with the first four words of each 5-gram in step 2. If it matches, then concatenating the last word of the matched 5-gram in step 2 with the matched 5-gram in step 1 generates a temporary candidate text for further processing. If a 5-gram in step 1 does not match with at least a single 5-gram in step 2, then the 5-gram in step 1 is a candidate text. One 5-gram in step 1 can match with several 5-grams in step 2, thus generating several temporary candidate texts. We continue this process until we cover all the steps. 3.3
Sorting Candidate Texts
It turns out from 3.2 that, if the input text is T , then the total n ˜ candidate texts are {T1 , T2 , · · · Ti · · · , Tn˜ }. We determine the correctness value, S for each candidate text using (12), a weighted sum of (2), (3) and (4), and then we sort in descending order by the correctness values. In (12), it is obvious that β1 + β2 + β3 = 1 to have S ∈ (0, 1]. S(Ti )=β1 S2 (Ti , T )+β2S3 (Ti , T )+β3S4 (Ti )
(12)
200
A. Islam and D. Inkpen
By trying to preserve the semantic meaning of the input text as much as possible, we intentionally keep the candidate texts and the input text as close (both semantically and syntactically) as possible. Thus, we set more weight on S2 and S3 . Though we set low weight on S4 , it is one of the most crucial parts of the method, that helps to identify and correct the error. If we only rely on the normalized frequency value of each candidate text, then we have to deal with an increasing number of false positives: the method detects an input text as incorrect, while, in reality, it is not. On the contrary, if we only rely on the similarity of common words, non-common words, and so on, between input text and each candidate text, then we have to deal with an increasing number of false negatives: the method detects an input text as correct, while, in reality, it is not.
4 4.1
Evaluation and Experimental Results Evaluation on WSJ Corpus
Because of the lack of a publicly available data set having multiple errors in short texts, we generate a new evaluation data set, utilizing the 1987-89 Wall Street Journal corpus. It is assumed that this data contains no errors. We select 34 short texts from this corpus and artificially introduce some errors, so that it requires to perform some combinations of insert, and/or delete, and/or replace operation to get back to the correct texts. To generate the incorrect texts, we artificially insert prepositions and articles, delete articles, prepositions, auxiliary verbs, and replace prepositions with other prepositions, singular nouns with plural nouns (e.g., spokesman with spokesmen), articles with other articles, real words with real-word spelling errors (e.g., year with tear ), real words with spelling errors (e.g., there with ther ). To generate real-word spelling errors (which are in fact semantic errors) and spelling errors, we use the same procedure as [28]. The average number of tokens in a correct text and an incorrect text are 7.44 and 6.32, respectively. The average number of corrections required per text is 1.76. We keep some texts without inserting any error, to test the robustness of the system (we got only a single false positive). This decreases the number of errors per text. The performance is measured using Recall (R), Precision (P), F1 and Accuracy (Acc). We asked two human judges, both native speakers of English and graduate students in Natural Language Processing, to correct those 34 texts. The agreement between the two judges is low (the detection agreement is 53.85% and the correction agreement is 50.77%), which means the task is difficult even for human experts. Table 3 shows two examples of test texts. The results in Table 4 show that our method gives comparable recall value for both detection and correction, whereas human judges give better precision value for both detection and correction. Since a majority of the words in the evaluation data set are correct, the baseline is to propose no correction, achieving 76.28% accuracy. Taking this baseline accuracy as a lower limit and the accuracy achieved by the human judges as an upper limit, we conclude that the automatic method realizes about half of the possible improvement between the baseline and the human expert upper bound (76%-84%-92%, respectively).
Correcting Different Types of Errors in Texts
201
Table 3. Some examples Incorrect Correct Judge 1 Judge 2 Our Method
Example 1 All funding All funding All funding All funding All funding
decisions decisions decisions decisions decisions
is made the are made by the is made by the are made the are made by the
Example 2 What his believed to the next What is believed to be the next What is believed to be next What he believed to be the next What is believed to be the next
Table 4. Results on the WSJ corpus Detection R P F1 Our Method 90.0 75.00 81.82 Judge 1 65.0 88.64 75.00 Judge 2 90.0 93.10 91.53
4.2
Correction R P F1 78.33 65.28 71.21 58.33 79.54 67.31 83.33 86.21 84.75
Acc. 84.98 86.56 92.89
Evaluation on JLE Corpus
We also evaluate the proposed method using the NICT JLE corpus [22], to directly compare with [24]. The JLE corpus has 15,637 sentences with annotated grammatical errors and their corrections. We generated a test set of 477 sentences for subject-verb (S-V) agreement errors, and another test set of 238 sentences for auxiliary agreement and complementation (AAC) errors by retaining the verb form errors, but correcting all other error types. [24] generated the same number of sentences of each category. [24] used the majority baseline, which is to propose no correction, since the vast majority of verbs were in their correct forms. Thus, [24] achieved a majority baseline of 96.95% for S-V agreement and 98.47% for AAC. Based on these numbers, it can be determined that [24] had only 14 or 15 errors in the S-V agreement data set and 3 or 4 errors in the AAC data set. Our data set has a majority baseline of 80.5% for S-V agreement and 79.8% for AAC. It means that we have 93 errors in the S-V agreement data set and 48 errors in the AAC data set. The small number of errors in their data set is the reason why they get high accuracy even when they have moderate precision and recall. For example, if their method fails to correct 2 errors out of the 3 errors in the S-V agreement data set (i.e., if true positive is 1 and false positives are 2), then their recall would be 33.3%, even then their accuracy would be 99.16%. Table 5 shows that our method generates consistent precision, recall, and accuracy. Table 5. Results on the JLE corpus.‘—’ means that the result is not mentioned in [24]. Detection R P F1 Lee (S-V) — 83.93 — Lee (AAC) — 80.67 — Our (S-V) 98.92 96.84 97.87 Our (AAC) 97.92 94.0 95.92
Correction R P F1 80.92 81.61 — 42.86 68.0 — 97.85 95.79 96.81 95.83 92.0 93.88
Acc. 98.93 98.94 98.74 97.48
202
5
A. Islam and D. Inkpen
Conclusion
The proposed unsupervised text correction approach can correct one error, which might be syntactic or semantic, for every word in a text. This large magnitude of error coverage, in terms of number, can be applied to correct Optical Character Recognition (OCR) errors, to automatically-mark (based on grammar and semantics) subjective examination papers, etc. A major drawback of our proposed approach is the dependence on the availability of enough 5-grams. The future challenge is how to tackle this problem, while keeping the approach unsupervised.
References 1. Brants, T., Franz, A.: Web 1T 5-gram corpus version 1.1. Technical report, Google Research (2006) 2. Atwell, E., Elliot, S.: Dealing with ill-formed english text. In Garside, R., Sampson, G., Leech, G., eds.: The computational analysis of English: a corpus-based approach, London, Longman (1987) 3. Gale, W.A., Church, K.W.: Estimation procedures for language context: Poor estimates are worse than none. In: Proceedings Computational Statistics, PhysicaVerlag, Heidelberg (1990) 69–74 4. Mays, E., Damerau, F.J., Mercer, R.L.: Context based spelling correction. Information Processing and Management 27(5) (1991) 517–522 5. Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Statistics and Computing 1(2) (December 1991) 93–103 6. Golding, A.R., Roth, D.: A winnow-based approach to context-sensitive spelling correction. Machine Learning 34(1-3) (1999) 107–130 7. Golding, A.R., Schabes, Y.: Combining trigram-based and feature-based methods for context-sensitive spelling correction. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics, Morristown, NJ, USA, Association for Computational Linguistics (1996) 71–78 8. Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accent restoration in spanish and french. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Morristown, NJ, USA, Association for Computational Linguistics (1994) 88–95 9. Gamon, M., Aue, A., Smets, M.: Sentence-level mt evaluation without reference translations: Beyond language modeling. In: European Association for Machine Translation (EAMT). (2005) 103–111 10. Sj¨ obergh, J.: Chunking: an unsupervised method to find errors in text. In Werner, S., ed.: Proceedings of the 15th NoDaLiDa conference. (2005) 180–185 11. Wang, C., Seneff, S.: High-quality speech translation for language learning. In: Proc. of InSTIL, Venice, Italy (2004) 12. Eeg-olofsson, J., Knutsson, O.: Automatic grammar checking for second language learners - the use of prepositions. In: NoDaLiDa, Reykjavik, Iceland (2003) 13. Chodorow, M., Leacock, C.: An unsupervised method for detecting grammatical errors. In: Proceedings of NAACL’00. (2000) 140–147 14. Atwell, E.S.: How to detect grammatical errors in a text without parsing it. In: Proceedings of the third conference on European chapter of the Association for Computational Linguistics, Morristown, NJ, USA, Association for Computational Linguistics (1987) 38–45
Correcting Different Types of Errors in Texts
203
15. Islam, A., Inkpen, D.: An unsupervised approach to preposition error correction. In: Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE’10), Beijing (August 2010) 1–4 16. Felice, R.D., Pulman, S.G.: A classifier-based approach to preposition and determiner error correction in L2 English. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, Coling 2008 Organizing Committee (August 2008) 169–176 17. Fouvry, F.: Constraint relaxation with weighted feature structures. In: Proceedings of the 8th International Workshop on Parsing Technologies, Nancy, France (2003) 23–25 18. Vogel, C., Cooper, R.: Robust chart parsing with mildly inconsistent feature structures. In Schter, A., Vogel, C., eds.: Nonclassical Feature Systems. Volume 10. Centre for Cognitive Science, University of Edinburgh (1995) Working Papers in Cognitive Science. 19. Wagner, J., Foster, J., van Genabith, J.: A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL). (2007) 112–121 20. Andersen, O.E.: Grammatical error detection using corpora and supervised learning. In Nurmi, V., Sustretov, D., eds.: Proceedings of the 12th Student Session of the European Summer School for Logic, Language and Information. (2007) 21. Foster, J., Vogel, C.: Parsing ill-formed text using an error grammar. Artif. Intell. Rev. 21(3-4) (2004) 269–291 22. Izumia, E., Uchimotoa, K., Isaharaa, H.: SST speech corpus of Japanese learners’ English and automatic detection of learners’ errors. ICAME Journal 28 (2004) 31–48 23. Brockett, C., Dolan, W.B., Gamon, M.: Correcting ESL errors using phrasal SMT techniques. In: ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Morristown, NJ, USA, Association for Computational Linguistics (2006) 249–256 24. Lee, J.S.Y.: Automatic Correction of Grammatical Errors in Non-native English Text. PhD thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science (June 2009) 25. Izumi, E., Supnithi, T., Uchimoto, K., Isahara, H., Saiga, T.: Automatic error detection in the japanese learners english spoken data. In: In Companion Volume to Proc. ACL03. (2003) 145–148 26. Islam, A., Inkpen, D.: Real-word spelling correction using Google Web 1T ngram data set. In Cheung, D.W.L., Song, I.Y., Chu, W.W., Hu, X., Lin, J.J., eds.: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, ACM (November 2009) 1689–1692 27. Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2 (July 2008) 10:1–10:25 28. Hirst, G., Budanitsky, A.: Correcting real-word spelling errors by restoring lexical cohesion. Natural Language Engineering 11(1) (March 2005) 87–111
Simulating the Effect of Emotional Stress on Task Performance Using OCC Dreama Jain and Ziad Kobti School of Computer Science, University of Windsor Windsor, ON, Canada, N9B-3P4 {jainh,kobti}@uwindsor.ca
Abstract. In this study we design and implement an artificial emotional response algorithm using the Ortony, Clore and Collins theory in an effort to understand and better simulate the response of intelligent agents in the presence of emotional stress. We first develop a general model to outline a generic emotional agent behaviour. Agents are then socially connected and surrounded by objects, or other actors, that trigger various emotions. A case study is built using a basic hospital model where nurse servicing patients interact in various static and dynamic emotional scenarios. The simulated results show that increase in emotional stress leads to higher error rates in nurse task performance. Keywords: Multi-agent system, emotions, behavior, affective computing.
1 Introduction Theories of emotion are rooted in psychology [1,2 and 4]. Artificial Intelligence community has been showing growing interest in emotional psychology, particularly artificial emotional response to formulate a more realistic social adaptation [5-7]. In previous work [11] we describe three emotional theories and design a computer algorithm for each. The Ortony, Clore and Collins theory (OCC) [1] is selected for this study since comparative work between OCC, Frijda’s theory [2] and Scherer’s theory [3] produce relatively similar results in simulated benchmarks. OCC captures the cognitive structure of emotions. It is commonly used in computational models of emotions. According to this theory, there are cognitions and interpretations that lead to the generation of an emotion. These cognitions are determined by the interaction of events, agents, and objects. There is neurological evidence among others showing a correlation between emotional stress and task performance. This can be trivial in observed human behaviour as one would underperform or have an increased likelihood of error when performing a given task. We aim to replicate this behaviour in a simulated session by first creating a generic algorithm outlining such artificial behaviour, and second to test the agent task performance in the presence of emotional stress. Consequently we build a multi-agent simulation which generates emotions in agents and observes the influence of emotions on agent’s behavior. In the next section, we highlight related work which has been done so far in this area and discuss the psychological theories that we have used in creating algorithms C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 204–209, 2011. © Springer-Verlag Berlin Heidelberg 2011
Simulating the Effect of Emotional Stress on Task Performance Using OCC
205
for our simulation. Next we detail the model, the process of emotion generation and its influence on behaviour. In the next section we perform some experiments using a case study of a hospital system and generating emotions in patients and nurses using OCC theory under different settings followed by concluding remarks.
2 Related Work Gratch and Marsella [8] and [9] introduce a domain independent framework of emotion known as Emotion and Adaptation (EMA) which not only implements appraisal of events but also generates a coping process for the event and the emotion generated. The authors used a doctor’s example to generate emotions for a child patient and then generate coping strategies in order to cope from that situation. In [9] the authors have generated a method to evaluate their EMA model by comparing it with human behavior, using stress and coping questionnaire. The authors conclude that their work is very close to the results of the questionnaire, with some limitations. Adam, Herzig and Longin [10] describe the recent work done in building emotional agents with the help of BDI logic and the formalization of emotional theories as described by OCC [1]. Their work introduces a logical framework based on BDI (Belief, Desire and Intention) which consists of agents, actions and atomic formulae. With the help of full belief, probability, choice, and like/dislike, various emotions are formalized and a task is performed after decision-making. The authors conducted a case study related to Ambient Intelligence. They developed a logical formalization of twenty two emotions as described by OCC. The authors state that modeling of triggering of emotions from mental states has been done in their research.
3 Generic Model In a generic model we build a multi-agent system where agents are placed randomly on a two dimensional grid. Here an agent represents a human with emotional response in accordance to the underlying emotional theory used, in this case the OCC. There are several randomly distributed objects on the grid which emit emotion as in liking or disliking of the object. The positions of the objects are fixed while the agents move according to a given move speed parameter. Agents can interact with their neighbours on the grid according to a set communication distance. Every time step, in order for an agent to interact with its neighbours it moves randomly on the grid. The current position of the agent is checked relative to a nearby object and a corresponding emotion is reflected in the agent. Next an event is triggered and the agent checks for neighbours in the surrounding area depending upon the communication distance parameter. Now according to the emotional theory a corresponding emotion is generated. These steps are performed for all the agents. According to the OCC theory, if an object is found then according to its value it is recognized as either liked or disliked. If there are no neighbours around the agent then one of the wellbeing, prospect or attribution emotion is generated depending upon the event triggered. If there are neighbours, then for every neighbour agent some emotion is generated out of fortune of others or attribution emotion, again depending upon the
206
D. Jain and Z. Kobti
event triggered. A generated emotion history is kept updated. The wellbeing emotion either generates a pleased or a displeased emotion. Furthermore, in prospect emotions we use a set probability for an event to happen according to which hope, fear, satisfaction, fears-confirmed, disappointment and relief emotions can be generated. Attribution emotions generate approval and gratification or disapproval and remorse for the agent’s own actions. For another agent’s action it can generate approval and gratitude or disapproval and anger. Fortune of others emotions can again lead to a pleased or displeased emotion but this time due to some other agent’s action. In order to see the influence of the emotion on the agent, every time step an agent performs some task. If the agent’s emotional state is happy or on the positive side, then the agent can perform the task logically, that is in the way the task was expected to be performed. On the other hand, if the agent’s emotional state is unstable and is on the negative side, like emotional states such as despair, disappointment, sadness, or disgust, then there are chances that the agent may not be able to perform the task efficiently as it is expected to be performed. We define a task to be a process which is composed of a sequence of steps. If each step is performed the way it is expected then the task is said to be completed correctly. We represent the task using a weight directed graph. Every step of the task (node of the graph) has some weight associated with it, which is summed in order to check the completion of the graph. A task for instance can have around 4 to10 steps. The weight of the task can be defined as the sum of the individual weights of the steps of the task. These weights are actually the attention required by the agent to perform that step. In other words, the attention factor of a step to be performed is represented as the weight associated with the node of the graph. A step can have an attention factor varying from 0 to 100. A task is performed by traversing through the graph and summing the weights/attention factor attached to the node/step of the graph/task being traversed. Logically a task is said to be completed if all the steps of the task are performed, that is all the nodes of the graph are visited. So at the completion of the task logically we get a final sum of the measure of attention factors of each step of the task. A task is performed logically if the emotional state of the agent is either on the positive side or neutral. When the emotional state of the agent is on the negative side, then the agent is assumed to commit mistakes or perform the task in a different manner than performing it logically. Under the influence of negative emotion humans tend to make mistakes and sometimes skip a step while performing a task or make decisions emotionally which are not logical. With this motivation in mind, in our simulation if the agent is under the influence of negative emotion [12] then it tends to miss a step or more while performing the task according to its current emotional state. When the agent misses one or more steps, the total sum of the weight of the task is different from that expected or would have occurred when performed logically. We plot this difference of the task completion logically and emotionally on a graph to observe the behavior of the agents. The average of the task attention for all nurses achieved logically and emotionally is plotted on the graph. Once an agent performs the same task over and over again for a number of times, the agent learns and its ability of making mistakes even under the influence of negative emotion is reduced. For every agent we check how many times it has performed the same task and update it every time the agent performs the same task. Once a threshold is reached, like when the agent has performed the same task for over
Simulating the Effect of Emotional Stress on Task Performance Using OCC
207
20 times, then the agent learns from its experience and gets used to performing the task without making further mistakes even under the influence of negative emotion.
4 Case Study: Nurse-Patient Hospital System We use a case study inspired by the hospital system described in [13]. In this model there are two kinds of agents: patients and nurses. Initially all the patients and nurses are allocated to patient rooms and nurse offices respectively. There are 38 patient agents and 5 nurse agents. The patient agents have fixed locations, while the nurse agents move from one room to another with their main task of servicing the patient. Patients buzz when they need a nurse. Nurses follow a path from there room to the patient’s room, who has buzzed, to service that patient. The path in the simulation has been represented by weighted directed graphs, in which nodes represent particular area and edges represent ability to travel between two adjacent areas. Following this path nurses serve the patients, but in between this path a nurse agent may see another nurse agent and they may interact with each other. The hospital system runs for different time steps, while one time step is equivalent to 12.5 seconds simulated in real time. Initially every patient agent has some emotion which is dependent upon the severity of that patient. In other words if a patient is more severe then he may have more negative emotions. When a nurse serves a patient, the affect of patient’s emotion is reflected on the nurse’s emotional state. For example, if a nurse sees a patient in pain and disappointed by his condition, the nurse may feel sad and displeased. While if a nurse sees a patient recovering and satisfied, the nurse may also feel satisfied and pleased. In this scenario we have used OCC theory for generating emotions, both in patients as well as nurses. Moreover when the nurse interacts with other nurses, their emotional state is also affected. This update in the emotion then causes change in the behavior of the nurses while they perform other tasks, such as preparing medications, documentation, etc. In our simulation we see the affect of emotion on the nurse’s behaviour while they perform their tasks.
5 Experiments and Results We have used three different settings to perform experiments with the simulation. We ran the simulation for 10 times with each setting. In the first setting the patient agent’s emotion is fixed and the nurse does not interact with other nurses. The simulation shows that whenever there are more nurses with a negative state of emotion like displeased, then the nurses tend to skip some step in their tasks. But this does not happen frequently as the patient’s emotion state is constant and more nurses have positive emotional state. In the second setting nurses interact with other nurses while the patient’s emotion is fixed. In this setting, the nurses communicate with each other if they are in the same room. When they communicate their emotional state may also changes. Now as more nurses interact, their emotional states change more often. If they have negative emotional state then they tend to skip a step or more while performing their tasks. The comparison between pure logical way of performing a
208
D. Jain and Z. Kobti
task and emotionally performed task, in this case, show a lot of difference. More nurses interact, their emotional state changes more often and they tend to make mistakes more often. Since the patient’s emotion is constant, when negative emotion is generated and tends to multiply among nurses when the nurses influence each other. A large pattern of mistakes being made by nurses is seen. In the third setting, the patient’s emotional state changes with time. The patient’s emotion also changes when the nurses’ visit them. We see a pattern, when there is increase of unhappy patients, unhappy nurses also increase and task performance is affected. When the number of unhappy patients decreases, there are less unhappy nurses and consequently more tasks are performed logically. Figure 1 shows task performance of the nurses, logically and emotionally. 450 400 350
Task Attention
300 250 200 150 100 50 0 0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
Time Step logical
emotional
Fig. 1. Graph showing task attention vs. time for logical and emotional performance of the nurse for Setting 3
6 Conclusion and Future Work We define a generic emotional model and apply it to the nurses and patients in the hospital simulation system. It has been observed that the model is able to generate emotions for the nurses depending on the emotions of the patients. Moreover the performance of the task is very much dependent upon the emotion of the nurses. We can conclude that our model is able to generate emotions according to the situations agents encounter and the general model can be implemented in other simulations to
Simulating the Effect of Emotional Stress on Task Performance Using OCC
209
generate emotions and observe the behavior under the influence of these emotions. This model can still be improved by adding learning and adapting capabilities in the agents. When an agent comes across the same situation, like a nurse sees a patient in pain, again and again then they adapt to the situation and then they will not generate the same emotion but will get used to the situation. Moreover personality of the agent can also affect the generation of the emotion, so defining the personality of the agent can be a future work.
References 1. Ortony, A., Clore, G., Collins, A.: The cognitive structure of emotions. Cambridge University Press, Cambridge (1988) 2. Frijda, N.H.: The Emotions. Cambridge University Press, Cambridge (1986) 3. Scherer, K.: Emotion as a multicomponent process: a model and some cross-cultural data. Review of Personality and Social Psychology 5, 37–63 (1984a) 4. Lazarus, R.S.: Emotion and Adaptation. Oxford University Press, Oxford (1991) 5. Sloman, A.: Motives, Mechanisms, and Emotions. In: Boden, M. (ed.) The Philosophy of Artificial Intelligence. Oxford University Press, Oxford (1990) 6. Russell, S.J., Norvig, P.: Artificial Intelligence: a Modern Approach. Prentice-Hall, Inc., Englewood Cliffs (1995) 7. Reilly, N.: Believable social and emotional agents. Doctoral thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA (1996) 8. Gratch, J., Marsella, S.: A domain independent framework for modeling emotion. Jour. of Cog. Sys. Res. 5(4), 269–306 (2004) 9. Gratch, J., Marsella, S.: Evaluating a Computational Model of Emotion. Autonomous Agents and Multi-Agent Systems 11(1), 23–43 (2005) 10. Adam, C., Herzig, A., Longin, D.: A logical formalization of the OCC theory of emotions. Synthese 168, 201–248 (2009) 11. Jain, D., Kobti, Z.: Emotionally Responsive General Artificial Agent Simulation. In: FLAIRS 2011, AAAI Proceedings (to appear, 2011) 12. Dijksterhuis, A.: Think Different: The Merits of Unconscious Thought in Preference Development and Decision Making. J. of Pers. and Soc. Psych. 87(5), 586–598 (2004) 13. Bhandari, G., Kobti, Z., Snowdon, A.W., Nakhwal, A., Rahman, S., Kolga, C.A.: AgentBased Modeling and Simulation as a Tool for Decision Support for Managing Patient Falls in a Dynamic Hospital Setting. In: Schuff, D., Paradice, D., Burstein, F., Power, D.J., Sharda, R. (eds.) Deci. Supp., Annals of Information Systems, vol. 14, pp. 149–162. Springer, Heidelberg (2011)
Base Station Controlled Intelligent Clustering Routing in Wireless Sensor Networks Yifei Jiang and Haiyi Zhang Jodrey School of Computer Science, Acadia University Wolfville, Nova Scotia, Canada, B4P 2R6
Abstract. The main constrains for Wireless Sensor Network (WSN) are its limited energy and bandwidth. In industry, WSN deployed with massive node density produces lots of sensory traffic with redundancy. Accordingly, it decreases the network lifetime. In our proposed approach, we investigate the problem on energy-efficient routing for a WSN in a radio harsh environment. We propose a novel approach to create optimal routing paths by using Genetic Algorithm (GA) and Dijkstra’s algorithm performed at Base Station (BS). To demonstrate the feasibility of our approach, formal analysis and simulation results are presented. Keywords: Genetic Algorithm; Dijkstra’s algorithm; WSN.
1
Introduction
Most Wireless Sensor Networks (WSNs) use battery-powered sensor nodes. By surveying the whole field of WSNs, the main problem is network lifetime, which demonstrates by the limited energy supplies for every sensor node. The primary function for WSN is to perform data communication between BS and all sensor nodes. It is expected that WSNs can work for a long duration. Due to the limited energy resources, this expectancy puts constraints on the energy usage. Moreover, according to the limited bandwidth, lots of sensory traffic with redundancy are generated with the massive node density. This turns out to increase the overload on sensors, which, in turn, will drain their power quickly. For this reason, so many attempts have been made to prolong nodes lifetime and to eliminate redundant data in [1], [2], [3]. Our proposed technique is designed by absorbing the essence from those aforementioned technologies. To be specific, we proposed a Base Station Controlled Intelligent Clustering Routing (BSCICR) protocol by using an aggregation method to handle redundant data, and creating an energy-efficient multi-hop routing based on cluster-heads (CHs) by using Dijkstra’s algorithm at BS. Moreover, we improved the fitness function of GA to produce better clusters and CHs. Therefore, the main contribution of this paper is to create an energy-efficient routing protocol for a large-scale WSN used in a radio harsh environment, and balance the energy load among all participating nodes. BSCICR protocol is based on a clustering scheme; however, it has the following specified features: 1. All source nodes are randomly deployed in a WSN with the same initial energy. 2. BS is located far away from the sensor field C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 210–215, 2011. c Springer-Verlag Berlin Heidelberg 2011
Intelligent Energy-Efficient Routing in WSN
211
and has a powerful processor and persistent stable energy supply. 3. The source nodes and BS in the WSN are all stationary. 4. The non-CH nodes in the WSN perform sensing and transmitting tasks at a fixed rate. As for the CHs, they only perform data aggregation and transmitting operations. To calculate the energy consumption in BSCICR, we utilize the first order radio model described in [1].
2
System Design
For a dynamic environment, the topology for WSN is constantly changed due to the increasing number of dead nodes. In BSCICR, a GA is used to create a globally optimal clusters and CHs based on the network topology. Therefore, the changes for the topology of WSN will greatly influence the accuracy of the result of GA. For this reason, scheduler used in BSCICR is to maintain relatively high accurate results of GA. The online scheduler generates a data aggregation tree, which is the energy-efficient routing paths produced by Dijkstra’s algorithm executed at BS. The generated schedule is defined as the data aggregation tree along with its frequency (intervals). To explain, if an aggregation tree Ti is used for n number of rounds, then at the nth round, all living nodes send their current energy status along with their data packets to BS. Once the scheduler at BS receives those information, BS updates the network topology accordingly. Then, Dijkstra’s algorithm at BS will create a new energy-efficient data aggregation tree Ti+1 based on the up-to-date topology. After a schedule was generated, BS broadcasted it as the multiple packets to the WSN. Meanwhile, in order to decrease the overall network latency, BS sends out synchronization pulses to all sensor nodes. This ensures that all clusters start their data transmission phase at the same time. The schedule consists two single packets, named as IntervalSchedulePacket and ClusterNodeSchedulePakcet shown in Figure 1. As IntervalSchedulePacket on the left, it is composed of five elements, which are Frequency, NodeID, CHID, ClusterID and ChildList. The last four elements represent the addressing scheme used in BSCICR. This addressing scheme is based on the source nodes’ attributes and geographical deployment positions, which is shown as the form of < N odetypeID, LocationID >. For instance, the NodeID and CHID are the source nodes’ identification numbers, which indicate the type of the source nodes as non-CH node and CH node, respectively. As for the ClusterID and ChildList, it shows the composition of each cluster. As the (n − 2) bytes in ChildList, it is the maximum size that child list can have, if there are n source nodes in the corresponding cluster. BSCICR utilizes a “Round-Robin” scheduling scheme to minimize the collisions happened when non-CH nodes transmitted the data to CHs. Therefore, Frequency in IntervalSchedulePacket defines as the interval used in “Round-Robin” scheduling scheme, which denotes as milliseconds. For example, if one of the non-CH nodes starts transmitting its sensory data, the other nodes only transmit their sensory data after certain intervals. After non-CH nodes finished sensing process, they only turn their radio on when it is their turn to transmit the data; otherwise, they are set to be in the sleep state (keep their radio off). This can also
212
Y. Jiang and H. Zhang
Fig. 1. Schedule packets
provide the benefit to delay the first node death for the whole WSN. Another single schedule packet in Figure 1 is the core schedule packet for our proposed approach, which is only assigned and executed by CHs. This ClusterNodeSchedulePakcet can be divided into two parts: Control Segment and Process Segment. The Control Segment consists of the same < N odetypeID, LocationID > addressing scheme. For the other three elements, CurrentRound, LastRoundFlag and Routing-SequenceList : the first one represents the number of current data gathering round, which initially is set to be 1. This field will be increased by 1 after every data gathering round. LastRoundFlag is used to judge whether the current round is the last data gathering round or not. When it is set to be “True”, it will trigger the Scheduler to generate and disseminate the new schedule, where all elements in Control Segment will be changed. As for the third one, it describes a routing sequence constructed by Dijkstra’s algorithm that guides CHs to transmit the gathering data to BS in an energy-efficient way. With regard to the Process Segment, it contains the processing codes used for eliminating the redundant data. These process codes are dispatched by BS. They are also encapsulated at BS to do further processing on the data gathered from all CHs after each data gathering round. Genetic Algorithm. With the constantly changed number of sensor devices for a WSN used in a radio harsh environment, we utilize a GA to generate an approximate optimal solution. To the best of our knowledge, researchers still use the same methods listed in [4] for handling all these operations of selection approach, crossover and mutation. We will not explain them here. We choose the permutation method to encode a chromosome. In permutation encoding, every chromosome is a string of numbers that indicates the routing sequence. In BSCICR, A chromosome is a collection of genes and represents a single cluster for a given WSN. Each chromosome has the size of fixed length that is determined by the number of source nodes in the WSN. The other key components and operators of GA, such as, fitness and fitness function are described as following: A. Fitness Parameters. The fitness parameters below are designed to define the fitness function of GA. Most important, they are the critical guidelines to minimize the energy consumption during the data transmission phase and prolong the network lifetime. 1. Node Distance (CD): it denotes as the sum of the spatial transmission distances from all non-CH nodes to their corresponding CHs. As described below, for one of the clusters in a given WSN, with j source nodes (indexed from i to j ), and has the coordinates (xch , ych ) and (xnon−ch , ynon−ch ) for the CH and one of the non-CH nodes, respectively, CD is j−1 calculated as CD = i=1 (xch − xnon−ch )2 + (ych − ynon−ch )2 . CD is a very
Intelligent Energy-Efficient Routing in WSN
213
important parameter utilized to control the size of cluster in the cluster setup phase. For a large scale of WSN, the value of CD will be very big, and thus the energy consumption will be higher considered the aforementioned radio model. Hence, in order to achieve the energy-efficient routing, we need to focus on reducing the value of CD to a small number. 2. Routing Distance (D): it represents as the spatial transmission distance between any CH and BS. If the coordinates of CH and BS are (xch , ych ) and (xbs , ybs ), respectively; then, D can be represented
as D = (xbs − xch )2 + (ybs − ych )2 . Parameter D is the core part of BSCICR protocol. According to the radio model, the energy consumed on transmit amplifier of a sensor node is proportional to D4 . Therefore, for energy-efficient purpose, the value of D should be small. 3. Average CHs Distance (CH ): it denotes as the average distances among all CHs. In BSCICR, through executing the Dijkstra’s algorithm at BS, a shortest routing path is generated. The routing path here is a routing sequence list, which contains the multi-hop transmission paths among CHs. After a certain number of data gathering rounds, a few new CHs will be generated by GA. It turns out that the distances between CHs are changed. Therefore, take into account the radio model, CH should also be set as a small value to reduce the energy cost √ on routing the gathering data between n
(x −x )2 +(y −y )2
n i n i CHs. CH is calculated as CH = i=1 n(n−1)/2 , where (xn , yn ) and (xi , yi ) represent as the coordinates of any two different CHs. As for n (n ≥ 1), it indicates the number of CH. In our knowledge, for a complete graph with n vertices, there are at most n(n − 1)/2 edges. 4. Energy Consumption (Ec ): it represents the energy consumed on any cluster of a given WSN. For example, for a given cluster C with j source nodes (indexed from i to j ), Ec can be defined j as Ec = i=1 ET(i,CH) + j × ER + (j − 1) × EDA + ET(CH,BS) . In this equation, the sum of the energy consumed on transmitting the sensory data from every non-CH node to its CH node is denoted as the first term. As for the second term, it shows the energy dissipation on the CH for receiving the gathering data from all non-CH nodes. Regarding to the third and fourth terms, they represent the energy expenditure on executing the operation of data aggregation and transmitting the aggregation data from CH to BS, respectively. In order to design the energy-efficient routing protocol in WSN, obviously, on the precondition of ensuring a good data communication, the smaller value of Ec the better. 5. Number of Data Gathering Rounds (R): it is a predefined number dispatched by BS. According to R, GA decides when to start reproducing the next generation (the populations). Moreover, the value of R can be adjusted by the current energy status of all source nodes in the WSN. Furthermore, if R is assigned to be a larger value for the current population of GA, it indicates that this population has a better fitness value than others, which means this population will be used for a longer period. A reasonable larger value of R will be good for the fitness function of GA to generate the small variations in the best fitness value of the chromosomes. 6. Percentage of CHs (P ): it is the ratio of the total number of active CHs NCH over the total number of participating source nodes NP S (include CH non-CH and CHs) in the WSN, which is defined as P = N NP S × 100%. Here, only
214
Y. Jiang and H. Zhang
alive nodes are considered as participating nodes. In BSCICR, P is computed after the cluster initial phase, since we can get an optimal value of P due to the aforementioned optimal number of clusters k, which is equal to NC H at that moment. Although after certain number of data gathering rounds R, the value of NP S may decrease due to running out of the power on some participating nodes, we should still keep the same value of P for every data gathering round to distribute the participating nodes evenly in each cluster. In this way, it can maintain the lowest energy load on each CH for every data gathering round. Accordingly, extend the network lifetime by postponing the first node death. B. Fitness Function. The fitness function is defined over the above fitness parameters and used to measure the quality of the represented chromosomes. In BSCICR, GA is performed at BS. This provides the BS with the ability to determine the optimal cluster formation that will give the minimum energy consumption during run time. The fitness function in f (x) is represented as f (x) = i (αi × f (xi )), ∀f (xi ) ∈ {CD, D, CH , Ec , R, P }. In this expression, αi is a set of arbitrarily assigned weights for the above fitness parameters. After every generation, the best-fit chromosome is evaluated and all six fitness parameters are updated as Δfi = f (xi+1 ) − f (xi ). The Δfi in this equation represents the change in the value of fitness parameters, where index i(i ≥ 1) represents the number of generations. Therefore, Δfi can be described as the subtraction of the fitness value for the current population and previous population. After every generation, the above six fitness parameters are evaluated to see the improvements. As for the initial weight αi , it can be calculated αi = αi−1 +ci ·Δfi , where ci = 1+e1−fi improves the value of weights based on the previous experience [4]. A suitable range of αi is assigned in the Section of Simulation.
3
Simulation
The simulator is implemented using Java language under Eclipse development environment. The communication channel in WSN is assumed as ideal. Figure 2 shows the graphical user interface of our simulator. From Figure 2, we can clearly see the layout of the simulator. It consists of three parts. First part is on the top of the simulator, which contains the control and input panels.
Fig. 2. Wireless Sensor Network Simulator
Intelligent Energy-Efficient Routing in WSN
215
All simulation parameters can be manually adjusted by using the input text fields on the simulator. For example, the size of WSN can be scaled by manually configuring the network and cluster size in step 2. Through scaling, although the general layout of WSN remains the same due to the limited size of the simulator (800×800 pixels), the spacing among all nodes is adjusted according to the given disk size. Moreover, user can define the network initial state, as well as choose desired data structure and algorithm implementation. The graph panel is in the middle of the simulator, where the graphical results are displayed there. Those results represent the generated energy-efficient routing paths and optimal CHs. For graph visualization, the expected energy-efficient routing paths are shown as a red color, which are distinguished with the connected graph shown on the left with a black color. For example, when all simulation parameters are set in step 1 and 2, after pressing Create and then Connect button in step 3, we can get the corresponding completed graph with all CHs connected, which is shown as the black graph. Next, after pressing Simulate button, the shortest routing paths will be shown as the colored graph. In addition, BS is also displayed as blue and red in both graphs respectively, which is different with the black color represented as all CHs. As for the output panel, it is at the bottom of the simulator, in which shows the data results for all simulations of different protocols. In this panel, user can check the status of energy (minimum, average, maximum and standard), the maximum data gathering rounds achieved, the basic network configuration as well as the selected algorithms and simulation scenarios.
4
Conclusions and Future Work
We proposed a GA-based solution for a large-scale WSN used in a radio harsh environment. We described this WSN as a set of chromosomes (population), which is represented by a GA to compute the approximate optimal routing paths. By utilizing Dijkstra’s algorithm, we were able to transform a dynamic topology of the entire network to a complete graph. More study of GA with improved fitness function is our next step to improve our approach.
References 1. Heinzelman, W.R., Chandrakasan, A., Balakrishnan, H.: Energy-efficient communication protocol for wireless microsensor networks. In: Hawaii International Conference on System Sciences (2000) 2. Muruganathan, S., Ma, R.B.D., Fapojuwo, A.: A centralized energy-efficient routing protocol for wireless sensor networks. IEEE Communications Magazine 43, S8–S13 (2005) 3. Hussain, S., Matin, A.W., Islam, O.: Genetic Algorithm for Hierarchical Wireless Sensor Networks. Journal of Networks 2(5) (2007) 4. Goldberg, D.E.: Genetic algorithms in search, optimization, and machine learning (1989)
Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and SecondOrder Co-occurrence Measures Colette Joubarne and Diana Inkpen School of Information Technology and Engineering University of Ottawa, ON, Canada, K1N 6N5 [email protected], [email protected]
Abstract. Despite the growth in digitization of data, there are still many languages without sufficient corpora to achieve valid measures of semantic similarity. If it could be shown that manually-assigned similarity scores from one language can be transferred to another language, then semantic similarity values could be used for languages with fewer resources. We test an automatic word similarity measure based on second-order co-occurrences in the Google ngram corpus, for English, German, and French. We show that the scores manually-assigned in the experiments of Rubenstein and Goodenough’s for 65 English word pairs can be transferred directly into German and French. We do this by conducting human evaluation experiments for French word pairs (and by using similarly produced scores for German). We show that the correlation between the automatically-assigned semantic similarity scores and the scores assigned by human evaluators is not very different when using the Rubenstein and Goodenough’s scores across language, compared to the language-specific scores.
1 Introduction Semantic similarity refers to the degree to which two words are related. Measures of semantic similarity are useful for techniques such as information retrieval, datamining, question answering, and text summarization. As indicated by Irene Cramer [2] many studies such as question answering, topic detection, and text summarization,, rely on semantic relatedness measures based on word nets and/or corpus statistics as a resource. However, these approaches require large and various amounts of corpora, which are often not available for languages other than English. If it could be shown that measures of semantic similarity have a high correlation across languages, then values for semantic similarity could be assigned to translated ngrams; thus enabling one set of values to be applied to many languages. Determining semantic similarity is routinely performed by humans, but it is a complex task for computers. Gabrilovich and Markovitch [3] point out that humans do not judge text relatedness only based on words. Identification of similarity involves reasoning at a much deeper level that manipulates concepts. Measures of similarity for humans are based on the larger context of their background and experience. Language C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 216–221, 2011. © Springer-Verlag Berlin Heidelberg 2011
Comparison of Semantic Similarity for Different Languages
217
is not merely a different collection of characters, but is founded on a culture that impacts the variety and subtlety of semantically similar words. For example, in French, often described as the “language of love”, the verbs “to like” and “to love” both translate to “aimer”. The word pair “cock, rooster” from Rubenstein and Goodenough [10] translate to “coq, coq” in French, and “Hahn, Hahn” in German. Rubenstein and Goodenough [10] defined the baseline for the comparison of semantic similarity measures. However, the fact that translation is not a 1:1 relation introduces difficulty in the use of a baseline. Understanding whether it is possible to use translated words to measure semantic similarity using corpora from another language is the goal of this experiment.
2 Related Work Automatically assigning a value to the degree of semantic similarity between two words has been shown to be quite difficult [5]. Rubenstein and Goodenough [10] presented human subjects with 65 noun pairs and asked them how similar they were on a scale from 0 to 4. Miller and Charles [8] took a subset of this data (30 pairs) and repeated this experiment. Their results were highly correlated (97%) to those of the previous study. Semantic similarity is a fundamental task in Natural Language Processing, therefore many different approaches to automate measures of semantic similarity of words have been studied. Jarmasz and Szpakowicz, [7] used a computerized version of Roget’s Thesaurus to calculate the semantic distance between the word pairs. They achieved correlation of 0.82 with Rubenstein and Goodenough’s [10] results. Budanitsky and Hirst [1] compared 5 different measures of semantic similarity based on WordNet. They found that when comparing the correlation of each measure with Rubenstein and Goodenough’s [10] human evaluator scores, the difference between the automatic measures was small (within 0.05). Islam and Inkpen [6] introduced Second Order Co-occurrence PMI as a measure of semantic similarity, and achieved results with a 0.71 correlation to Rubenstein and Goodenough [10] when measured using the British National Corpus (BNC)1. Hassan and Mihalcea [4] use the interlanguage links found in Wikipedia to produce a measure of relatedness using explicit semantic analysis. They achieved a correlation with Miller and Charles [8] word pairs between 0.32 and 0.50 for Spanish, Arabic and Romanian. Not surprisingly, they found that better results were achieved for languages with a larger Wikipedia. Mohammad et al [9] proposed a new method to determine semantic distance combining text from a language, such as German, which has fewer corpora available, with a knowledge source in a language with large corpora available, such as English. They combined German text with an English thesaurus to create cross-lingual distributional profiles of concepts to achieve a correlation of 0.81 with Rubenstein and Goodenough’s word pairs [10]. Typically, two approaches have been used to solve multilingual problems, rulebased systems and statistical learning from parallel corpora. Rule-based systems usually have low accuracy, and parallel corpora can be difficult to find. Our approach 1
http://www.natcorp.ox.ac.uk/
218
C. Joubarne and D. Inkpen
will be to use manual translation and language-specific corpora, in order to measure and compare semantic similarity for English, French and German, using second-order co-occurrence.
3 Data The data used was the Google n-gram corpus, which included n-grams (n=1-5) generated from roughly 100 billion word tokens from the web for each language. Only the unigrams, bigrams and 5-grams were used for this project. Since the purpose is to compare the semantic similarity of nouns only, and to compare results achieved on the same data, it was decided that removal of non-alphabetic characters and stemming of plurals was sufficient for our purposes.2 The word pairs were taken from Rubenstein and Goodenough [10] and translated into French using a combination of Larousse French-English dictionary, Le Grand dictionnaire terminologique, maintained by the Office quebecois de la langue francaise, a couple of native speakers and a human translator. In some cases where the semantic similarity of the word pair was high, the direct translation of each word in the word pair resulted in the same word. In these cases the pair was left out completely. The semantic similarity of the translated words was then evaluated by human judges. The 18 evaluators, who had French as their first language, were asked to judge the similarity of the French word pairs. They were instructed to indicate, for each pair, their opinion of how similar in meaning the two words are on a scale of 0-4, with 4 for words that mean the same thing, and 0 for words that mean completely different things. The results were averaged over the 18 responses (with the exception of three word pairs, where the respondents left their scores blank, so these were only averaged over 17). For 71% of the word pairs there was good agreement amongst the evaluators, with over half of the respondents agreeing on their scores; however in 23% of the cases, there was high disagreement with scores ranging from 0-4. The results can be seen in Appendix A3, which presents the words pairs for the three languages used in our study together with the similarity scores according to human judges. The German translation of the word pairs, including human evaluation of similarity, was borrowed from Mohammad et al [9]. Some of the word pairs do not match exact translations. Since the focus of their study was on the comparison between scores from human evaluators and automated results, they addressed the issue of semantically similar words resulting in identical words during translation, by choosing another related word. A comparison of the frequencies for similarity values amongst all evaluators for each language, presented in Table 1, shows that the English and German scores are similarly distributed, whereas the French scores are more heavily weighted around a score of 0 and 1. 2
3
Stopword removal and stemming was performed during further research, but it was found that results were significantly worse for stopword removal and stemming, and relatively unchanged for stopword removal alone. Stopword lists were taken from Multilingual Resources at University of Neuchatel. The Lingua stemming algorithms was used. Available at http://www.site.uottawa.ca/~mjoub063/wordsims.htm
Comparison of Semantic Similarity for Different Languages
219
Table 1. Frequency of similarity scores Similarity Score 0 1 2 3 4
English 0 25 12 8 20
Frequency German 4 19 16 4 22
French 15 23 5 10 12
4 Methodology Unigram and bigram counts were taken directly from the 1-gram and 2-gram files, taking into account characters and accents in the French and German alphabets. Second order counts were generated from the 5-gram data. Two measures of semantic similarity were used, point-wise mutual information and second order co-occurrence point-wise mutual information. These measures were calculated for each set of word pairs, and compared to the baseline measures from the original data set, as well as the new values generated by human evaluators. Point-wise mutual information (PMI) measure is a corpus-based measure, as opposed to a dictionary-based measure of semantic similarity. PMI measures the more general sense of semantic relatedness where two words are related by their proximity of use without necessarily being similar. The PMI score between 2 words w1 and w2 is defined as the probability of the 2 words appearing together divided by the probability of each word occurring separately. PMI was chosen because it scales well to larger corpora, and it has been shown to have the highest correlation amongst corpus-based measures [6]. Second order co-occurrence PMI (SOC-PMI) is also a corpus-based measure that determines a measure of semantic relatedness, based on how many words appear in the neighbourhood of both words. The SOC-PMI score between 2 words w1 and w2 is defined as the probability of word y appearing with w1 and of y appearing with w2, within a given window in separate contexts. SOC-PMI was chosen because it fits well with the Google n-gram corpora. The frequencies for a window of size 5 are easily obtained from the 5-gram counts. The formula can be found in Islam and Inkpen [6].
5 Results The PMI and SOC-PMI scores were calculated for each set of word pairs and compared to both the scores collected by Rubenstein and Goodenough [10] and the language specific scores collected from human evaluators (see Table 2). Table 2. Pearson correlation of calculated PMI and SOC-PMI scores with R&G scores and new human evaluator scores Language English French German
PMI 0.41 0.34 0.40
vs. R&G SOC-PMI 0.61 0.19 0.27
PMI n/a 0.29 0.47
vs. Evaluators SOC-PMI n/a 0.17 0.31
220
C. Joubarne and D. Inkpen
6 Discussion Our best correlation of 0.61 for the English SOC-PMI is not as good as that achieved by Islam and Inkpen [6]. However, their correlation of 0.73 was achieved using the BNC. The higher results could possibly be explained by the lack of noise in the BNC (discussion of noise issues found in Google n-gram corpus appears in Section 7), as well as the ability to use a larger window than supported by the Google 5-grams. The correlation of the SOC-PMI scores and the original scores was slightly lower than for the human scores for the German word pairs, and slightly higher for the French word pairs. Almost 2/3 of the French and German word pairs had a SOC-PMI of 0. This is reflected in the poor correlation values and is likely due to the fact that the French and German corpora were approximately 1/10 the size of the English corpus.
7 Conclusion and Future Work Given the lack of data for over 2/3 of the French and German pairs, it is not possible to make any claims with any certainty; however, since the results were not significantly improved by using language specific human evaluation, the results do suggest that it might be possible to transfer semantic similarity across languages. While further work needs to be done to confirm our hypothesis, we have produced a set of human evaluator scores for French which can be used for future work. Although results were improved from earlier work, given the larger corpora for English, it appears that larger French and German corpora are still required to draw any significant conclusions. The Google n-gram corpora, for both French and German, contain approximately 13 billion tokens each; however, many of these tokens are not words. There are strings of repeating combinations of letters and many instances of multiple words in one token. For example, there are roughly 500-1000 tokens containing “abab” or “cdcd” and every other combination. There are 2000 occurrences of “voyageurdumonde” and 5000 of “filleougarcon”. Future work of this type with the Google n-grams should consider using a dictionary to filter out these kinds of tokens. Another approach would be to select words that are common in all of the languages of interest, and that result in unique word pairs after translation. A new baseline would have to be created. This would require some study of word frequencies, and effort being spent in having the semantic similarity of the word pairs evaluated by human evaluators. Budanitsky and Hirst [1] suggest a different approach. In their comparison of 5 different measures of semantic similarity, they suggest that comparing only to human evaluator scores is not a sufficient comparison, and that what we are really interested in is the relationship between the concepts for which the words are merely surrogates; the human judgments that we need are of the relatedness of word senses, not for words. They attempt to define such an experiment, and find that the effectiveness of the 5 measures varies considerably when compared this way. The idea of using the
Comparison of Semantic Similarity for Different Languages
221
relatedness of word senses, not of words, could possibly overcome some of the issues4 encountered when translating the word pairs.
Acknowledgements We address our thanks to the Social Science Research Council (SSHRC) and to the Natural Sciences and Engineering Research Council (NSERC) of Canada for supporting this research work. We thank Aminul Islam for sharing his code for SOCPMI. We thank Saif Mohammad for sharing the German word pair similarity scores. We also thank Stan Szpakowicz for his comments on the draft of this paper.
References 1. Budanitsky, A., Hirst, G.: Semantic distance in WordNet: An experimental, applicationoriented evaluation of five measures. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources, Pittsburgh (2001) 2. Cramer, I.: How Well Do Semantic Relatedness Measures Perform? A Meta-Study. In: Proceedings of STEP 2008 Conference, vol. 1, pp. 59–70 (2008) 3. Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In: Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India (January 2007) 4. Hassan, S., Mihalcea, R.: Cross-lingual Relatedness using Encyclopedic Knowledge. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, pp. 1192–1201 (August 2009) (to appear) 5. Inkpen, D., Desliets, A.: Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts. In: EMNLP 2005, Vancouver, Canada (2005) 6. Islam, A., Inkpen, D.: Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038 (May 2006) 7. Jarmasz, M., Szpakowicz, S.: Roget’s Thesaurus and semantic similarity. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2003), Borovets, Bulgaria, pp. 212–219 (2003) 8. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1), 1–28 (1991) 9. Mohammad, S., Gurevych, I., Hirst, G., Zesch, T.: Cross-lingual distributional profiles of concepts for measuring semantic distance. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CoNLL 2007), Prague, Czech Republic (2007) 10. Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Communications of the ACM 8(10), 627–633 (1965)
4
In some cases the translation of word pairs resulted in the same word, and in other cases the result produced a phrase, or a more obscure word. For example – “midday, noon” = “midi, midi”, “woodland” = “region boisée”, and “mound” = “monticle”.
A Supervised Method of Feature Weighting for Measuring Semantic Relatedness Alistair Kennedy1 and Stan Szpakowicz1,2 1
SITE, University of Ottawa, Ottawa, Ontario, Canada {akennedy,szpak}@site.uottawa.ca 2 Institute of Computer Science Polish Academy of Sciences, Warsaw, Poland
Abstract. The clustering of related words is crucial for a variety of Natural Language Processing applications. Many known techniques of word clustering use the context of a word to determine its meaning. Words which frequently appear in similar contexts are assumed to have similar meanings. Word clustering usually applies the weighting of contexts, based on some measure of their importance. One of the most popular measures is Pointwise Mutual Information. It increases the weight of contexts where a word appears regularly but other words do not, and decreases the weight of contexts where many words may appear. Essentially, it is unsupervised feature weighting. We present a method of supervised feature weighting. It identifies contexts shared by pairs of words known to be semantically related or unrelated, and then uses Pointwise Mutual Information to weight these contexts on how well they indicate closely related words. We use Roget’s Thesaurus as a source of training and evaluation data. This work is as a step towards adding new terms to Roget’s Thesaurus automatically, and doing so with high confidence.
1
Introduction
Pointwise Mutual Information (PMI) is a measure of association between two values of two random variables. PMI has been applied to a variety of Natural Language Processing (NLP) tasks, and shown to work well when identifying contexts indicative of a given word. In effect, PMI can be used to give higher weights to contexts in which a word occurs frequently, but other words appear rarely, while giving lower weight to contexts with distributions closer to random. Finding these weights requires no actual training data, so it is essentially an unsupervised method of context weighting, an observation also made in [1]. In our paper we show how to incorporate supervision into the process of context weighting. We learn appropriate weights for the contexts from known sets of related and unrelated words extracted from a thesaurus. PMI is then calculated for each context: we measure the association between pairs of words which appear in that context and pairs of words which are known to be semantically related. The PMI scores can then be used to apply a weight to the contexts in which a word is found. This is done by building a word-context matrix which records the counts C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 222–233, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Supervised Method of Feature Weighting for MSRs
223
of how many times each word appears in each context. By applying our weighting technique to this matrix, we are effectively training a measure of semantic relatedness (MSR). In our experiments, unsupervised PMI measures association between a context and a word, while supervised PMI measures association between a context and synonymy. We also perform experiments combining these supervised and unsupervised methods of learning semantic relatedness between word pairs. Our system uses data from two versions of Roget’s Thesaurus, from 1911 and from 1987, for our supervised context weighting method. We also compare the two versions of Roget’s and determine how its age and size affect it as a source of training data. We use SuperMatrix [2], a system which implements a variety of MSRs. Specifically, we use its Cosine similarity and PMI MSRs. The corpus we use for building our word-context matrix is Wikipedia. Motivation This work is designed to be a step towards automatically updating Roget’s Thesaurus through identifying semantically related words and clustering them. Our goal is not to create a new thesaurus from scratch but rather to update an existing one. We can therefore try to use the existing thesaurus as a tool for learning how words are related, which in turn can help update Roget’s. Rather than relying on unsupervised word similarity metrics, we can use Roget’s Thesaurus to train potentially superior word similarity metrics. This has been partially inspired by [3], where machine learning is used to learn from a corpus words related by hypernymy. Training on known hypernym and non-hypernym pairs in WordNet [4] allows the system to learn to identify hypernyms for adding to WordNet. Roget’s is structured quite differently from WordNet, so the technique of [3] is not appropriate here, but we adopt the “bootstrapping” idea of using a lexical resource to aid in its own expansion.
2
Related Work
A variety of corpus-based Measures of Semantic Relatedness (MSRs) have been developed by NLP researchers – see [5] for an in-depth review. Corpus-based MSRs generally work by representing word w as a vector of contexts in which w appears. The context can be as broad as the document where w appears [6], or as specific as one word, for example in a verb-object relationship with w [7]. Contexts are most often determined using a dependency parser to extract from the text triples w, r, w , where word w is related to another word w by relationship r. The context of w is then the pair r, w . This technique has been widely applied [8–11]. There have been attempts to incorporate some supervision into the process of learning semantic distance. In [12], a function consisting of weighted combinations of precision and recall of contexts is proposed for measuring semantic relatedness. In this function there are two thresholds which the authors optimize
224
A. Kennedy and S. Szpakowicz
using a set of training instances. Many variations on their measure were evaluated on the task of predicting how closely word clusters match that of a thesaurus (as we do), and on pseudo-word-sense-disambiguation. This involves minimal supervision: only two thresholds are learned. There also is related work on learning weights for short document similarity. In [13, 14] a method of learning weights in a word-document matrix was proposed. The authors weighted terms to learn document similarity rather than weighting contexts to learn word similarity. The method was to minimize a loss function rather than to apply PMI. They compared their system against TF.IDF weighting of documents. The documents they used were actually queries and the task was to identify advertisements relevant to a given query. [15] presents another related project. A combination of supervised and unsupervised learning determines whether one verb can be a paraphrase of another. Unsupervised learning is used to bootstrap instances where one verb can be replaced by another. These bootstrapped examples are then used to train a classifier which can tell in what contexts one word can replace another. A supervised method of learning synonyms in [1] is probably the work most closely related to ours. A variety of methods, both distributional and patternbased, for identifying synonymy is followed by machine learning to combine these methods. Such combination was found to give improvement over individual methods. We do not use supervision to combine methods of identifying synonyms but rather to determine the weights for a measure of semantic relatedness. PMI itself has been widely used in NLP. In [16], PMI is used to learn word sentiment by measuring the association between a phrase and other words known to be positive or negative. PMI has also been applied to named entity extraction from text [17] and query classification into types [18]. In [19], PMI is used in an unsupervised manner to assign weights to a word-context matrix. This process is further described in Section 3.
3
Unsupervised Use of PMI for Measuring Semantic Relatedness
We use PMI for both supervised and unsupervised learning of context weights. In this section we describe how PMI is used in an unsupervised way. PMI is actually a measure of association between two events, x and y: P (x, y) P M I(x, y) = log (1) P (x) ∗ P (y) When those two events are a particular word and a particular context, we can measure association between them and use this as a weighting scheme for measuring semantic distance [19]. This is what is calculated when using PMI for unsupervised term-context matrix weighting. To create the term-context matrix we used a tool called SuperMatrix.
A Supervised Method of Feature Weighting for MSRs
3.1
225
SuperMatrix
SuperMatrix [2] is a tool which has implemented a large variety of MSRs on a word-context matrix. These include other variations on PMI [20] and Lin’s measure [8], and measures proposed in [12]. A number of variations on these measures and many others, all referred to as RankWeight Function (RWF) [21, 22] have been implemented and are shown to enhance many of those measures. RWF is interesting as it applies one context weighting function on top of another. Likewise, we will apply different weighting methods on top of each other when we combine supervised and unsupervised context weighting. To use SuperMatrix, we give it a single query word q and ask for it to return the set of 100 words w1 ..w100 most closely related to q.1 To construct a word-context matrix to run the SuperMatrix MSRs, we applied the same methods as [8]. We parsed with Minipar [23] a corpus comprised of about 70% of Wikipedia.2 The parsing results supply dependency triples w, r, w . We split these triples into two parts: a word w and a pair r, w – the context in which w is found. Examples of triples are time, mod, unlimited and time, conj, motion, where the word “time” appears in the context with the modifier “unlimited” and in a conjunction with “motion”. The word-context matrix is constructed from these dependency triples. Each row corresponds to a word w, each column – to one of the contexts, C. That cell of the matrix records count(w, C): how many times w is found in C. As we learn either supervised or unsupervised weights, we change the values in this matrix from straight counts to more appropriate weights. Each row in this matrix is essentially a vector representing a word. The distance between two words is the distance between their vectors. To reduce noise, only words appearing 50 or more times and contexts appearing 5 or more times are included. This gives us a total of 32743 words and 321152 contexts. The average word appears in approximately 480 unique contexts, while each context appears as a feature in around 50 words. We only used nouns in our experiments. 3.2
Applying Unsupervised PMI
A PMI score determines to what extent a word and a context appear together beyond random chance. In this case we have the probabilities P (x) of seeing the word, P (y) of seeing the context and P (x, y) of seeing both together. This is calculated for all contexts in all word vectors. The actual distance between two words a and b is the distance between the vectors of contexts for those words, A and B respectively. One of the most common means of measuring distance between vectors – and indeed the measure we apply – is cosine similarity: cos(A, B) = 1 2
A•B AB
(2)
Scores for each word, in the range 0..1, are provided, but we only need rank. That was a dump of August 2010. 70% was the most data we could process on a computer with 4GB of RAM.
226
A. Kennedy and S. Szpakowicz
Vectors which appear closer together are assumed to have much more similar meaning while vectors that appear farther apart are assumed to have less related meanings. Our two unsupervised MSRs will be plain cosine similarity and PMI weighting with cosine similarity.
4
Supervised Learning of Context Weights
In this section we describe how a weight for each context is learned. For this we need training data, we turn to Roget’s Thesaurus to provide us with lists of known related and unrelated words. 4.1
Roget’s Thesaurus
Roget’s Thesaurus is a nine-level hierarchical thesaurus. The levels, from top to bottom, are Class → Section → Sub-Section → Head Group → Head → Part of Speech → Paragraph → Semicolon Group → Words/Phrases. Earliest published versions of Roget’s come from the 1850s, but it has been constantly under revision: new editions are released every few years. We will use two version of Roget’s. Open Roget’s [24] is a publicly available Java implementation intended for use in NLP research, built on Roget’s data from 1911.3 The second version is proprietary, based on data from the 1987 edition [25]. Generally we prefer to work with public-domain resources. Still, the 1987 Roget’s Thesaurus gives us an opportunity to see how a newer and larger resource compares to an older and smaller one. Roget’s contains a variety of words and phrases divided into four main parts of speech: Nouns, Verbs, Adjectives and Adverbs. In our experiments we will only work with Nouns. The main concepts in Roget’s are often considered to be represented by the Heads, of which there are usually about 1000. The division into parts of speech occurs between the Head and the Paragraph, so that each main concept (Head) contains words in different parts of speech. The smallest grouping in Roget’s is the Semicolon Group (SG), while the next smallest is the Paragraph. SGs group together near-synonyms, while Paragraphs tend to contain a little more loosely related words. An example of some of the Noun SGs and Paragraphs from the Head for “Language” can be seen in Figure 1. Each SG is delimited by a semicolon while Paragraphs start with an italicized word/phrase and end in a period. Our evaluation requires information from the SG and Paragraphs in Roget’s. Table 1 shows the statistics of those groupings: the counts of Noun Paragraphs, SGs, their average sizes in words, and the total count of all Nouns. The latter includes duplicates when a noun appears in two or more SGs. A phrase counts as a single word, although the individual words inside it could be used as well. The 1911 Roget’s has more paragraphs, but the 1987 version has more SGs, more words and a higher average number of words in each grouping. The 1987 Thesaurus should be better for evaluation: it simply has more labeled data. 3
rogets.site.uottawa.ca
A Supervised Method of Feature Weighting for MSRs
227
language; phraseology; speech; tongue, lingo, vernacular; mother tongue, vulgar tongue, native tongue; household words; King’s English, Queen’s English; dialect. confusion of tongues, Babel, pasigraphie; pantomime; onomatopoeia; betacism, mimmation, myatism, nunnation; pasigraphy. lexicology, philology, glossology, glottology; linguistics, chrestomathy; paleology, paleography; comparative grammar. Fig. 1. Excerpt from the Head for “Language” in the 1911 Roget’s Thesaurus Table 1. Counts of Semicolon Groups and Paragraphs, their average sizes, and all Nouns in Roget’s Thesaurus Year Para Count Words per Para SG Count Words per SG Noun Count 1911 4495 10.3 19215 2.4 46308 1987 2884 39.7 31174 3.7 114473
4.2
Supervised Weighting
We want to measure the association between pairs of words appearing in a context and a pair of words appearing in the same SG. For each context C, all the words w1 ..wn which appear in C are collected and all pairs of these words are recorded. C is a pair r, w , while each word wi in w1 ..wn appears in the triple wi , r, w in the parsed Wikipedia. We then find in Roget’s all words in the same SG as wi ∈ w1 ..wn , and record these pairs. Only the words also found in our word-context matrix are included in these counts. These groups of word pairs can be treated as events for which we measure the Pointwise Mutual Information, effectively giving the context C a score. Words which appear in our set of 500 test cases are not included when learning the weights of the contexts. To calculate the PMI, we count the following pairs of words wi , wj (C is a context): – wi and wj are in the same SG and share C [True Positives (tp)]; – wi and wj are in different SGs and share C [False Positives (fp)]; – wi and wj are in the same SG and only one of them appears in C [False Negatives (fn)]; – wi and wj are in different SGs and only one of them appears in C [True Negatives (tn)]. We define the probability of event x as P (x) = x/(tp + tn + f n + f p). Essentially we build a confusion matrix and from it calculate the probabilities. Next, we calculate the PMI for context C, effectively giving a score to this context. P (tp) score(C) = log (3) P (tp + f p) ∗ P (tp + f n) This is repeated for every context in our word-context matrix. Once all the scores have been generated, we can use them to re-weight our word-context matrix. For
228
A. Kennedy and S. Szpakowicz
every word wi which appears in a given context C, its count count(wi , C) is multiplied by score(C). Calculating this number for all contexts is not trouble-free. For one, not all contexts will appear in the training data. To avoid this, we normalize every score(C) calculated in Equation 3 so that the average score(C) is 1; next, we assume that any unseen contexts also have a weight of 1; finally, we multiply the count of context C by score(C) for every word in which C appears. Another problem is that PMI may give a negative score when the two events are less likely to occur together than by chance. In such situations we set score(C) to zero. Another problem is that often the supervised PMI is calculated with a fairly small number of true positives and false negatives, so it may be difficult to get a very reliable score. The unsupervised PMI matrix weighting, on the other hand, will use the distributions of a word and context across the whole matrix, so often will have more data to work with. It may, then, be optimistic to think that supervised PMI will on its own outperform unsupervised PMI. The more interesting experiments will be to see the effects of combining supervised and unsupervised PMI MSRs. 4.3
Experiment Setup
The problem on which we evaluate our technique is that of ranking closely related words. We select a random set of 500 words found in our SuperMatrix matrix and both in the 1911 and 1987 Roget’s Thesaurus, from a possible set of 11725. These 500 words were not used for matrix weighting, described in Section 4.2. For each of these words we use our MSRs to generate a ranked list of the 100 most closely related words in our matrix. These lists are evaluated for accuracy at various levels of recall using Roget’s Thesaurus as a gold standard. Specifically we measure the accuracy at the top 1, 5, 10, 20, 40 and 80 words. We take words from a list of the top 100 but not all of these 100 words will appear in Roget’s. That is why there will be cases in which we cannot find all 40 or 80 words to perform our evaluation. In such cases we simply perform our evaluation on all the words we can use from that list of 100. As shown in Table 1, the newer and larger 1987 version contains more words known to be semantically related than the 1911 version, so we will only use it for evaluation. We measure accuracy at identifying words in the same SG and the same Paragraph. This is done because, when adding new words to Roget’s, one may want to take advantage of both the closely related words (SG) and more loosely related words (Paragraph). In our evaluation we run six different MSRs. We use unsupervised cosine similarity and an unsupervised PMI MSRs as low and high baselines. We also test cosine similarity when context weights are learned using both the 1987 and 1911 Roget’s Thesaurus. These MSRs are denoted 1987-Cosine and 1911-Cosine. They can be compared to the unsupervised PMI MSR. Finally we attempt to combine the supervised and unsupervised matrix weighting. This is done by first applying the weighting learned through supervision to the word-context matrix and then using the unsupervised PMI MSR on that matrix, once again for both
A Supervised Method of Feature Weighting for MSRs
229
versions of Roget’s. These MSRs are denoted 1987-PMI and 1911-PMI. Although this may not seem intuitive, it is not so different from the RWF measures, in that two ranking methods are combined. Sample lists generated with two of these measures, 1911-Cosine and PMI, appear in Figure 2. 1911-Cosine – backbencher (0.715), spending (0.657), bureaucracy (0.645), funding (0.619), agency (0.616) PMI – incentive (0.200), funding (0.192), tax (0.187), tariff (0.180), payment (0.176) Fig. 2. The top 5 words related to “Subsidy”, with their similarity score using the supervised 1911-Cosine MSR and unsupervised PMI
5
Experiment Results
We evaluate our new supervised MSRs as well as the unsupervised MSRs on two kinds of problems. In one, we evaluate the ranked list by calculating its accuracy in finding words in the same SG. The second evaluation is done by determining accuracy at finding words in the same Paragraph. 5.1
Ranking Words by Semicolon Group
We count the number of words found to be in the same SG and those known to be found in different SGs in Roget’s Thesaurus. From this we calculate the accuracy of each MSR for the top 1, 5, 10, 20, 40 and 80 related words – see Table 2. In evaluating our results, we broke the data into 25 sets of 20 lists and performed Student’s t-test to measure statistical significance at p < 0.05. The numbers are in bold when a supervised MSR shows a statistically significant improvement over its unsupervised counterpart. Table 2. Evaluation results for identifying related words in the same Semicolon Groups Measure Cosine PMI 1987-Cosine 1987-PMI 1911-Cosine 1911-PMI
Top 1 .110 .368 .146 .378 .146 .372
Top 5 .070 .243 .092 .240 .097 .242
Top 10 .052 .188 .071 .187 .073 .189
Top 20 .039 .136 .055 .136 .055 .138
Top 40 Top 80 .031 .024 .100 .072 .042 .034 .101 .073 .042 .034 .100 .073
Our lower baseline MSR – cosine similarity – does quite poorly. In comparison, 1987-Cosine and 1911-Cosine gives a relative improvement of 30-40%. Supervised learning of context weights using PMI improved the Cosine similarity MSR by a statistically significant margin in all cases. Surprisingly, in a number of cases 1911-Cosine performs slightly better than 1987-Cosine. Figure 2 may suggest
230
A. Kennedy and S. Szpakowicz
why supervised PMI did worse than unsupervised PMI. The latter tended to retrieve closer synonyms, while the former selected many other related words. Supervised matrix weighting with PMI (1911-Cosine and 1987-Cosine) did not work as well as unsupervised matrix weighting with PMI. As noted in Section 4 this is not entirely unexpected. Combining the supervised and unsupervised PMI weighted methods does in some cases show an advantage. 1987-PMI and 1911-PMI showed a statistically significant improvement only when the top 40 and 20 words were counted respectively. That said, in a few cases combining these measures actually hurt results, although never in a statistically significant manner; most often results improved slightly. It is easier to show a change to be statistically significant as more related words are considered, because it provides a more reliable accuracy. This is tested further where we perform evaluation on Paragraphs rather than on SGs. 5.2
Ranking Words by Paragraph
The experiments from Section 5.1 are repeated on Paragraphs – see Table 3. Obviously accuracy at all levels of recall is higher in this evaluation, because there are far more related words in the same Paragraph than in the same SG. Another interesting observation is that the improvement from combining supervised and unsupervised PMI matrix weighting was statistically significant much more often. 1987-PMI showed a statistically significant improvement over PMI when the top 20 or more closest words were used in evaluation. For 1911-PMI the improvement was statistically significant for the top 10 or more closest words. We found improvements of up to 3% when mixing the supervised and unsupervised matrix weighting. Table 3. Evaluation results for identifying related words in the same Paragraphs Measure Top 1 Cosine .256 PMI .624 1987-Cosine .298 1987-PMI .644 1911-Cosine .296 1911-PMI .640
Top 5 .206 .524 .240 .523 .240 .533
Top 10 .173 .466 .208 .470 .209 .478
Top 20 .148 .401 .180 .406 .182 .416
Top 40 .127 .345 .157 .349 .160 .352
Top 80 .110 .287 .138 .291 .141 .295
Once again evaluation on the 1911 Roget’s often performed better than on the 1987 version. It is easier to show statistically significant improvements for Paragraphs than for SGs, because the number of positive candidates grows higher. The data in Table 1 suggest that a word may only have a few other words in the same SG with it, while it will often have dozens of words in the same Paragraph. As a result, when we perform a t-test, each fold contains many more positive examples and so gives better estimate of how much incorporating supervised weighting actually improves these MSRs.
A Supervised Method of Feature Weighting for MSRs
5.3
231
Possible New Word Senses
We have not taken into account the possibility that new or missing senses of words are being discovered. If we look at the highest-ranked word in each list of candidates, we often find that the word appears to be closely related, but sometimes Roget’s labels them as not belonging in that Paragraph or SG. The following are a few of the more obvious examples of closely related words which did not appear in the same Paragraph: invader – invasion; infant – newborns; mafia – mob and evacuation – airlift. Although not all the candidates labeled as unrelated may be as closely related as these pairs, it appears clear that the accuracies we find should be considered as lower bounds on the actual accuracy.
6
Analysis and Conclusion
We have clearly shown that supervised weighting of word-context matrices is a significant improvement over unweighted cosine similarity. Our method of supervised weighting of word-context matrices with PMI was not as effective as unsupervised term weighting with PMI. We found, however, that combining supervised and unsupervised matrix weighting schemes often showed a statistically significant improvement. This was particularly the case when identifying more loosely semantically related words, in the same Paragraph rather than limiting occurrences of related words to the same SG. Never did combining supervised and unsupervised learning actually hurt the results in a statistically significant manner. There are simply are not enough words in the average SGs to prove that incorporating supervised training helps the PMI MSR. This is supported by the fact that when enough data is used – the top 10-20 related words – the evaluation on Paragraphs does show a statistically significant improvement. One surprise was that often weighting the word-context on the 1911 Roget’s Thesaurus performed slightly better than its counterpart weighted with the 1987 version. This is difficult to explain, but the differences between the two trained systems tended to be quite small. This does suggest that the 1911 version of Roget’s provides sufficient data for weighting of these contexts despite its smaller size. This is particularly good news, because the 1987 version is not publicly available, while the 1911 version is. 6.1
Future Work
The long-term motivation for this work is automatic updating of Roget’s Thesaurus with new words. The results we present here suggest that the first step toward that goal has been successful. Next, ranked lists will be used to determine which SGs and Paragraphs are good candidate locations for a word to be added. We applied two version of Roget’s Thesaurus for training our system, but it is quite possible to use other resources, including WordNet. It is also possible to use functions other than PMI for learning matrix weighting. Likelihood ratio tests are known to work well on rare events and should be considered [26].
232
A. Kennedy and S. Szpakowicz
Finally, let us note that we have only used our supervised matrix weighting technique to enhance Cosine similarity and PMI MSRs. Many other measures are available via SuperMatrix, and there are other resources on which supervised matrix weighting could be applied.
Acknowledgments Our research is supported by the Natural Sciences and Engineering Research Council of Canada and the University of Ottawa.
References 1. Hagiwara, M., Ogawa, Y., Toyama, K.: Supervised synonym acquisition using distributional features and syntactic patterns. Journal of Natural Language Processing 16, 59–83 (2005) 2. Broda, B., Jaworski, D., Piasecki, M.: Parallel, Massive Processing in SuperMatrix – a General Tool for Distributional Semantic Analysis of Corpus. In: Proceedings of the International Multiconference on Computer Science and Information Technology, pp. 373–379 (2010) 3. Snow, R., Jurafsky, D., Ng, A.Y.: Semantic Taxonomy Induction from Heterogenous Evidence. In: Proceedings of COLING/ACL 2006, Sydney, Australia (2006) 4. Fellbaum, C. (ed.): WordNet: an Electronic Lexical Database. MIT Press, Cambridge (1998) 5. Turney, P.D., Pantel, P.: From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010) 6. Crouch, C.J.: A Cluster-Based Approach to Thesaurus Construction. In: SIGIR 1988: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 309–320. ACM, New York (1988) 7. Ruge, G.: Automatic Detection of Thesaurus relations for Information Retrieval Applications. In: Foundations of Computer Science: Potential - Theory - Cognition, to Wilfried Brauer on the Occasion of his Sixtieth Birthday, pp. 499–506. Springer, London (1997) 8. Lin, D.: Automatic retrieval and Clustering of Similar Words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774. Association for Computational Linguistics, Morristown (1998) 9. Curran, J.R., Moens, M.: Improvements in Automatic Thesaurus Extraction. In: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pp. 59–66 (2002) 10. Yang, D., Powers, D.M.: Automatic Thesaurus Construction. In: Dobbie, G., Mans, B. (eds.) Thirty-First Australasian Computer Science Conference (ACSC 2008). CRPIT, vol. 74, pp. 147–156. ACS, Wollongong (2008) 11. Rychl´ y, P., Kilgarriff, A.: An Efficient Algorithm for Building a Distributional Thesaurus (and other Sketch Engine Developments). In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 41–44. Association for Computational Linguistics, Prague (2007) 12. Weeds, J., Weir, D.: Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity. Comput. Linguist. 31(4), 439–475 (2005)
A Supervised Method of Feature Weighting for MSRs
233
13. Yih, W.-t.: Learning term-weighting functions for similarity measures. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 2, pp. 793–802. Association for Computational Linguistics, Morristown (2009) 14. Hajishirzi, H., Yih, W.-t., Kolcz, A.: Adaptive near-duplicate detection via similarity learning. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 419–426. ACM, New York (2010) 15. Connor, M., Roth, D.: Context sensitive paraphrasing with a global unsupervised classifier. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 104–115. Springer, Heidelberg (2007) 16. Turney, P., Littman, M.: Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus. Technical report NRC technical report ERB-1094, Institute for Information Technology, National Research Council Canada (2002) 17. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005) 18. Kang, I.H., Kim, G.: Query type classification for web document retrieval. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 64–71. ACM, New York (2003) 19. Pantel, P.A.: Clustering by Committee. PhD thesis, University of Alberta (2003) 20. Evert, S.: The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis, Universit¨ at Stuttgart (2004) 21. Piasecki, M., Szpakowicz, S., Broda, B.: Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns. In: Matouˇsek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 99–106. Springer, Heidelberg (2007) 22. Broda, B., Derwojedowa, M., Piasecki, M., Szpakowicz, S.: Corpus-based Semantic Relatedness for the Construction of Polish WordNet. In: Calzolari, N., (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech (2008) 23. Lin, D.: Dependency-Based Evaluation of MINIPAR. In: Proceedings of the Workshop on the Evaluation of Parsing Systems, First International Conference on Language Resources and Evaluation (1998) 24. Kennedy, A., Szpakowicz, S.: Evaluating Roget’s Thesauri. In: Proceedings of ACL 2008: HLT, pp. 416–424. Association for Computational Linguistics, Morristown (2008) 25. Kirkpatrick, B. (ed.): Roget’s Thesaurus of English Words and Phrases . Longman, Harlow (1987) 26. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Anomaly-Based Network Intrusion Detection Using Outlier Subspace Analysis: A Case Study David Kershaw1, Qigang Gao1 , and Hai Wang2 1
Faculty of Computer Science, Dalhousie University {kershaw,qggao}@cs.dal.ca 2 Sobey School of Business, St. Mary’s University [email protected]
Abstract. This paper employs SPOT (Stream Projected Outlier deTector) as a prototype system for anomaly-based intrusion detection and evaluates its performance against other major methods. SPOT is capable of processing high-dimensional data streams and detecting novel attacks which exhibit abnormal behavior, making it a good candidate for network intrusion detection. This paper demonstrates SPOT is effective to distinguish between normal and abnormal processes in a UNIX System Call dataset.
1
Introduction
Intrusion detection is a field of study which focuses on detecting unwanted behaviours in a computer network. As networked computers are increasingly being used for the storage of sensitive materials, the demand for secure networks has intensified. Today, there is continuing difficulty preventing unwanted behaviours in computer network transactions. There are two major strategies of intrusion detection systems (IDS): misuse-based detection and anomaly-based detection. Misuse-based detection uses well studied patterns as indicators of intrusive activities. Misuse-detection is also known as signature based detection, since a predefined set of signatures are used as an outline of network attacks. Anomaly-based detection differs from this approach, as it attempts to model a system’s normal behaviour, then statistically determine whether new actions fall within the range of normal, with behaviour outside this range considered possibly harmful. The advantages of anomaly-based network intrusion detection have lead to research efforts in recent years to find effective anomaly-based methods for protecting security data, which is often high-dimensional and streaming. SPOT (Stream Projected Outlier deTector) was recently proposed to handle this challenge[8]. This paper aims to apply SPOT [8] as a prototype IDS, taking advantage of its ability to handle high dimensional streaming data, comparing its results to another well known IDS STIDE (Sequence-based Intrusion Detection Method) [2]. It is posited that SPOT will perform as well as existing methods, and contribute to a unique problem domain: finding abnormalities in high dimensional streaming data. SPOT is tested against the UNIX System Call dataset from C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 234–239, 2011. c Springer-Verlag Berlin Heidelberg 2011
Anomaly-Based Network Intrusion Detection Using Outlier Subspace Analysis
235
the University of New Mexico (UNM) [2]. First, an overview of anomaly based intrusion detection is presented through a discussion of some influential papers; second, a summary of the methodology of SPOT; third, the experimental setup and results, and last, a brief conclusion discussing the findings and ideas for future research.
2
Anomaly Based Intrusion Detection on UNIX System Calls
Protecting computer systems from intruder’s attackers has been the goal of much previous research. In order to properly protect against threats, normally an intrusion detection system (IDS) is configured. An IDS is a software and/or hardware system intended to protect against (usually external) penetration. Vunerabilities are continually being found in desktop software residing on a host system. By attempting to detect abnormal system behaviour via system calls, we can hope to augment a host’s current security level. First, two relevant studies are reviewed. In [2], Forrest et al. defined a unique method of modeling the properties of natural immune systems by analyzing short series of UNIX system calls in common processes. The authors posited that these series can define a ’self’, where abnormalities are detected by previously unseen sequences, i.e. changes in behaviour. These definitions must be flexible enough to allow legit activities, such as patching and program installations, while detecting abnormal, possibly harmful activities. The observed combination of UNIX system calls is consistently small allowing for identification a program’s self definition. A two step process is proposed: first, build a database of normal behaviour by scanning the traces and recording the observed sequences and second, examine new traces which could contain new patterns representing abnormal behaviour. The authors conclude that their methods have potential to operate as an effective online immunity system (STIDE). In [3] Forrest et al. examine four data models used for intrusion detection in system call data. Their methods originate from four different ideologies: 1) Enumeration of Sequences, 2) Frequency-Based Methods, 3) Data Mining Methods, and 4) Finite State Machines, with the goal of comparing their false positive and false negative rates, relative to the idealogy’s complexity. This study uses live, real world traces for its datasets. The authors conclude that variations in detection are more indicative of different data streams than an analysis method’s complexity. Our research attempts to augment these studies by examining the UNM data through subspace analysis using SPOT. It employs methods designed to accommodate real time high dimensional data streams, detecting data in outlying subspaces which could indicate network attacks. Using this method, it is possible create a normal profile of privileged UNIX processes, detect abnormalities which could indicate potential attacks and compare results with STIDE.
236
3 3.1
D. Kershaw, Q. Gao, and H. Wang
SPOT for Outlier Detection in High Dimensional Data Streams Methodology of SPOT
SPOT is an outlier detection approach capable of quickly processing high dimensional streaming data, which makes it a high-quality candidate for a prototype anomaly-based network IDS [8]. SPOT addresses two major challenges: 1) finding the outlying subspaces which house the projected outliers and 2) effectively analyzing streaming data. The first challenge results from the exponential growth of the constructed hypercube, which is unfeasible to inspect real-time. The second challenge addresses streaming data, which arrives ordered, allowing for only one chance to process. SPOT has three major contributions to the domain of high-dimensional outlier detection in data streams: 1. Employs a window based time model and decaying data summaries which allow the streaming data to be processed timely and effectively. 2. Builds a Sparse Subspace Template (SST), a group of subspaces which are used to detect outliers. 3. Employs a multi-objective genetic algorithm to produce the subspaces used in construction of the SST.
Fig. 1. Architecture of SPOT [8]
Fig. 1 shows the architecture of SPOT. SPOT can be used for offline learning and online learning. In offline learning, the SST is created, comprised of the set of subspaces most likely containing projected outliers. It’s components are the Fixed SST Subspaces (FS), the Unsupervised SST Subspaces (US) and the Supervised SST Subspaces (SS). The FS is comprised of all the available subspaces, up to a maximum set cardinality. The US uses MOGA to search for the set of
Anomaly-Based Network Intrusion Detection Using Outlier Subspace Analysis
237
subspaces most likely containing the highest number of projected subspaces. The SS is a place for domain experts to place subspaces already considered outlying based upon existing domain knowledge, allowing flexibility by directing SPOT, focusing the search. In the online learning stage, the incoming data is processed, updating the SST appropriately. 3.2
SPOT Based Network Intrusion Detection System
An anomaly-based IDS’s function is to effectively model a system’s normal behaviour, and detect deviating activities. Here, SPOT and STIDE are implemented as IDSs, and attempt to model the execution of UNIX processes. A model of normal behaviour is established by using the set of system calls a UNIX process generates during its execution. This model is further defined by including order, i.e. by taking sequences of system calls of a certain length. This is known as the window length and is often set to six; however, further research indicates an optimal window length can be determined [6]. A processes’ normal execution generates a specific set system call combinations. Using this collection, a process’s normal behaviour can be modelled, where abnormal behaviour is identified by any sequences outside the normal set. The raw datasets provided by UNM are in a format:pid, syscall, where pid = process id and syscall = system call. The system call number represents a unique UNIX system call (e.g. 5 = open). For SPOT, the data must be converted into a multi-dimensional vector-like format. The number of dimensions which make up the converted data is dependent on the size of the chosen sliding window. The window is reset once the process id switches, preserving the sequence information. SPOT is able to map this dataset to multidimensional space and create a normal profile through the relationships between subspaces. Each sequence is mapped, populating the cells in a hypercube. Each subspace’s density is measured by the amount of data mapped to it. Therefore, the hypercube, and its associated subspace densities, represent the profile of normal behaviour.
4
Experiment and Results
The UNM datasets have been widely used for testing the effectiveness of different algorithms [3]. The same datasets are employed here to evaluate SPOT’s effectiveness on ordered system call data, and then compared to STIDE. The data is split into training and testing sets. The training data consists solely of normal traces while the test data consists of a mix of normal and intrusive traces. The intrusive traces represent both real and simulated attacks which were injected into the data. A detector window length of size six was chosen as it has been shown in previous research that a detector window of at least six is necessary for the detection of anomalies [6]. The first step of the offline learning is the construction of the SST. Once the SST is fully constructed, the detection stage begins. During testing, any arriving data point mapped to a cell with a low density achieves a high abnormality value. SPOT places these data points in an
238
D. Kershaw, Q. Gao, and H. Wang
“Outlier Repository”, where a user can inspect it’s outlierness. With an outlierness number associated with each trace, a comparison can be made between normal and intrusive traces. Forrest et al. identified two measurements for evaluating their data modeling methods: false positive percentage, or the ratio of normal data classified as anomalous compared to total normal data used in testing and true positive percentage, the ratio of detected intrusions to undetected intrusions in the test data [3]. These measurements are used as the benchmark comparator for system evaluation. For a trace to be considered anomalous, one sequence call per a thousand must be classified anomalous. SPOT’s sensitivity threshold is a value for the outlierness score. If the data point exceeds this value, it is determined anomalous. In STIDE’s case, the threshold value is LFC size, or locality frame count size. This value represents the size of a frame STIDE employs to detect local anomalies.
Fig. 2. Overall Average True Positive Results for STIDE and SPOT across all datasets
The results from all datasets are displayed for both true and false positive rates in Fig. 2. This figure represents averages across all datasets. The x-axis shows the thresholds for each data modelling method, while the y-axis shows the percentages. The rightmost graph in Figure 2 compares STIDE and SPOT’s true positive rates. It shows that STIDE reaches maximum effectiveness at a threshold of 2, and tapers slowly upwards. SPOT meanwhile reaches maximum effectiveness around a threshold of 10-12, surpassing STIDE’s maximum, but is slow to gain ground on STIDE initially. The rightmost graph of Figure 2 compares the false positive rates. Initially, both STIDE and SPOT’s percentages are comparable, however, around a threshold of 10, STIDE’s false positive rate significantly jumps upwards, while SPOT’s rate does not considerably change. Toward the end of the threshold, SPOT’s rate trends upwards toward STIDE’s. Typically, a good mix of true positive rate versus false positive rate is required to determine which method performed the best, depending upon the goal of the intrusion detection system.
Anomaly-Based Network Intrusion Detection Using Outlier Subspace Analysis
5
239
Conclusion
Using the well-studied UNM dataset is an effective means for standardizing and comparing the implementation and results of a chosen data modeling method. SPOT is an ideal data modeling tool for application in this domain. SPOT’s ability to handle high dimensionality grants the ability of specifying any desired window size. Also, SPOT examines each n-dimensional subspace, allowing for the inspection of outlying subspaces within the entire window. This is clearly an advantage over STIDE, which depends on a predetermined window size, and does not allow for additional flexibility during its operation. An additional advantage is SPOT’s assignment of statistics for each data point (and subspace), regarding its outlierness. By assigning a specific score, SPOT can discern a degree to which the data point is anomalous, in addition to its inhabited subspaces. This allows for the user to specify thresholds for their individual systems. Finally, SPOT can be implemented online, which is necessary in this domain. As SPOT is a relatively new tool, its strengths and weaknesses are being determined. Applying SPOT to different datasets will provide insight into SPOT’s adaptation. Further research should focus on SPOT’s core abilities to process streaming high-dimensional data in real-time in domains where online examination of data is critical.
References [1] Aggarwal, C.C., Yu, P.S.: Outlier Detection for High Dimensional Data. In: SIGMOD 2001 (2001) [2] Forrest, S.A., Hofmeyr, S.A., Somayaji, A., Longstaff, T.A.: A Sense of Self for UNIX Processes. In: Proceedings of the 1996 IEEE Symposium on Security and Privacy, pp. 120–128 (2001) [3] Forrest, S.A., Warrender, C., Perlmutter, B.: Detecting Intrusions Using System Calls: Alternative Data Models. In: Proceedings of the 1999 IEEE Computer Society Symposium on Research in Security and Privacy, pp. 133–145 (1999) [4] Garcia-Teodoro, P., Diaz-Verdejo, J., Macia-Fernandez, G., Vazquez, E.: AnomalyBased Network Intrusion Detection: Techniques, Systems and Challenges. Computers and Security 28, 18–28 (2009) [5] Symantec Global Internet Security Threat Report: Trends for 2008, Vol. XIV (2008) (published April 2009) [6] Tan, K., Maxion, R.: Why 6? Defining the Operational Limits of Stide,and Anomaly-Based Intrusion Detectors. In: Proceedings of the 1996 IEEE Symposium on Security and Privacy, pp. 133–145 (2002) [7] Wenke, J., Salvatore, J.: Framework for Construction Features and Models for Intrusion Detection Systems. In: TISSEC, pp. 227–261 (2000) [8] Zhang, J.: Towards Outlier Detection for High-Dimensional Data Streams Using Projected Outlier Analysis Strategy. PHD Thesis. Dalhousie University (2008)
Evaluation and Application of Scenario Based Design on Thunderbird Bushra Khawaja and Lisa Fan Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {khawajab,fan}@cs.uregina.ca
Abstract. Scenario based design (SBD) approach has been widely used to improve the user interface (UI) design of interactive systems. In this paper, the effectiveness of using the SBD approach is shown to improve the UI design of the Thunderbird email system. Firstly, an empirical evaluation of the system was performed based on the user comments. Then, a low fidelity prototype of the modified interfaces was developed. Furthermore, the new design interfaces were evaluated using two evaluation methods: a) GOMS keystroke level model was used to compare the efficiency of two interfaces b) Heuristic Evaluation of the system was performed using Nielsen’s usability heuristics. The evaluation results show that the efficiency of accomplishing important and most discussed tasks is improved significantly. Applying SBD approach on email systems is concluded as a promising trend to enhance usability. Keywords: Scenario-based Design (SBD), Usability, GOMS Keystroke Level Model (KLM-GOMS), Email Systems, Thunderbird 3 (TB-3).
1 Introduction Scenario based design (SBD) methodology is extensively used to improve the user interface (UI) design of interactive systems. In human-computer interaction (HCI), user comments from discussion forums are used to evaluate the UI design of systems. The discussions of real users are very useful as they compare the strengths and weaknesses of the system with the other similar systems. User Interaction Scenario is a story about people and their activities [4]. A good scenario should include seven characteristic elements i.e. setting, actors, task goals, plans, actions, events and evaluation [6]. Scenario based design is a user-centered approach that is developed by Rosson and Carroll during the last ten years. It is an iterative approach that follows the phases of writing problem, activity, information and interaction scenarios based on the user interactions. After each phase, claims analysis is done to highlight the important design features and the positive and negative implications these features are analyzed. In case of more than one design alternatives, it helps to choose the best design. Real-world and technological metaphors are also brainstormed iteratively before each next phase for innovative design ideas. Explaining the SBD approach in C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 240–245, 2011. © Springer-Verlag Berlin Heidelberg 2011
Evaluation and Application of Scenario Based Design on Thunderbird
241
detail is beyond the scope of this paper. However, a couple of references might be helpful for interested readers [4], [5], [6]. During the last decade, SBD approach is applied to a number of systems. It was applied in the redesign of a hospital information system [2]. It was also used to improve the design of a digital library of geographical resources [10]. In another work, the phases of SBD were followed to design an interface for geo spatial data infrastructure [1]. In a recent work, three scenario-based methods were applied to develop a web browser system [9]. Most of these works found this approach very effective to improve the user interface design of systems. Email systems face a number of usability problems in terms of the efficiency and ease of use due to the increasing bulk of messages. To our knowledge, SBD method is never applied to any email system before where usability should be the most important concern. This is the reason that motivated us to apply this method to the TB email system. This study highlights the usability issues in the Thunderbird-3 email system by using the SBD method. The rest of the paper is organized as follows: section 2 briefly describes the research methodology and experiments by discussing UI design problems of existing system and suggested improvements shown with interfaces. Section 3 discusses the implications for the new design by using the two evaluation methods i.e. the KLM-GOMS and heuristic evaluation. Lastly, section 4 summarizes the work and states potential future work in this realm.
2 Methodology and Experiments The discussion forums from Mozilla Messaging webpage were used as data for writing scenarios [7]. The ‘ideas under construction’ categories and a few threads such as message pane layout, new message and address book in new tab were studied in more detail. The messages that exhibited thoughtful suggestions about UI design and provide details for most of the elements of scenario were used to write scenarios. A low fidelity prototype of the modified interfaces was developed. Providing detailed scenarios are beyond the space limitation of this paper. However, the improved features are listed in Table 1 briefly with their problems in the existing design and the suggested solutions in the new design. To better explain the new design ideas, the interfaces for each improved feature are shown in three figures grouped together as first three features in figure 1, next three in figure 2 and the last two in figure 3.
Fig. 1. Compact header (top most), attachment bar (centre), blocked content bar (lower most)
242
B. Khawaja and L. Fan Table 1. Considered features with existing design problems and suggested solutions
Features 1. Header Expanded (Figure 1, header top most) Compact header 2. Attachment bar (Figure 1, centre)
3. Blocked content bar (Figure 1, lower most) 4. New folder creation (Figure 2, centre)
5. Folder hierarchy (Figure 2, left most)
6. Open message options (Figure 2, right most)
7. Tabbed interface (Figure 3)
8. Listing and closing all tabs (Figure 3)
Existing Design Expanded by default; has to install an add-on to compact. Subject on the left (may be blank), sender’s address on the right besides the date. Wasted space: Displays the whole list of attachments regardless of the number of attachments.
New Design Suggested display compact header by default with ‘+’ icon to expand. Sender on the left, important tools on the right i.e. ‘Reply, Forward, Junk and Delete’. Left and right scroll arrows if more than 5 attachments and a link ‘More’ with a number to inform the number of attachments that are not shown. Narrowed to one line with always show content as a button on the same line. ‘New Folder’ provided under the Inbox; in pop-up window, it asks ‘Create as a subfolder of:’ that is, set to ‘Inbox’ by default. ‘Local Folders’ is suggested to change to ‘Thunderbird’ in new design. Shortened as username-account server name e.g. Dennis-Hotmail
Wide with wasted space: a lengthy link to always load content in the second line. Time consuming: ‘New Folder’ and ‘New Subfolder’ options are provided in the right click menu. Confusing: default folder for the main view is named as ‘Local Folders’. Lengthy name for the accounts folder i.e. the email address which creates confusion. Time Consuming and A menu with ‘Open Message’ provided in an irrelevant options is provided on the main menu i.e. Tools -> Options tools. Also, an information bubble -> Advanced -> Reading to inform that options work by double clicking. & Display Difficult to Notice a new tab as the tab bar with the ‘Inbox’ tab is already present on main page and always stays there. Inconsistent: Messages open in tabs while composing messages and address book opens in new windows. An Unclear Icon for listing all tabs
‘Inbox’ is made a main tool instead of a tab. Double clicking a message for the first time pops up a tab bar with a highlighted tab which makes it more noticeable. Composing messages and address book are suggested to open in tabs by default in new design.
Closing tabs is time consuming. A confusing option ‘Close other tabs’ is provided in the right click menu of ‘x’ button. No option to close all tabs together.
A prominent red ‘x’ button to close tabs altogether; A menu button is also provided to choose from the options as ‘Close Tab’, ‘Close Other Tabs’ or ‘Close All Tabs’
A clear prominent icon at the right most of tab bar
Evaluation and Application of Scenario Based Design on Thunderbird
243
All Folders
Dennis-Hotmail Inbox Drafts Sent Trash New Folder
New Folder
Folder Name :
Create as a sub folder of : Inbox
Thunderbird Create Folder
Cancel
Fig. 2. Folder column pane (left most), new folder creation interface (centre), open message menu (right most) i.e. New Tab, New Window, Existing Windows, Existing Tab, Conversation
Fig. 3. Tabbed interface: tab bar shows the control buttons on each tab, Close and List All Tabs options on the right most of tab bar
3 Results and Implications for New Design This section shows the promising results by using the two evaluation methods: a) KLM-GOMS is used to compare the time required to perform tasks involving the above discussed features using existing and new design interfaces b) Nielsen’s Heuristic Evaluation is performed to measure the usability of the new design ideas. GOMS Keystroke Level Model (KLM-GOMS). The time chart for the general operators provided by Card et al. [3] is used for calculation. Table 2 shows the calculated total time required (in seconds) to accomplish five tasks using the existing and new design interfaces. Table 2. Comparison of total task accomplishment time using KLM-GOMS Tasks
Task accomplishment time Existing Design New Design 1. Replying using ‘Reply’ from header bar 9.70 secs 3.05 secs 2. Creating new folder named ‘Friends’ 13.25 secs 9.25 secs 3. To change message opening settings 14.95 secs 5.70 secs 4. Writing a message & locating address book 15.05 secs 5.70 secs 5. Comparing two emails’ contents 36.90 secs 23.3 secs
Percentage Improvement 68 % 30 % 62 % 62 % 37 %
244
B. Khawaja and L. Fan
In table 2, it can be clearly seen from the comparative task accomplishment values that the time required to accomplish tasks using the new design is far less than the existing design with an average of 52% less time required for these five tasks. Also, most of the tools provided at hand in the new design improves flexibility and reduces mental effort significantly. For instance, open message, new folder, compact header tools, inbox in main tools and control buttons on tabs. Heuristic Evaluation: Nielsen’s Usability Heuristics. The usability heuristics provided by Jakob Nielsen [8] are very useful to evaluate the UI design of interactive systems. These are more like general rules of thumbs for evaluating the usability of systems. The heuristics were kept in mind during the entire study and were later used to evaluate and illustrate the positive implications for the suggested design. A few important usability heuristics are discussed below briefly for the features described in section 2. •
•
•
• • •
Visibility of System Status: All the system messages in TB-3 appear on the screen just for a few seconds that keep the users uninformed about what is going on. These messages are suggested to stay there for at least 20 seconds. For example, the messages that are shown while creating folder and downloading messages. User Control and Freedom: Control buttons on tabs make a great difference to the tabbing in email systems. As seen in Table 2, task 5, it improves the efficiency and gives the user a feeling of control. So, email systems should support control buttons on tabs. Consistency and Standards: Tabbed interface is made consistent in the new design throughout the system by suggesting that writing messages and address book should open in tabs. Standards as sender name on the header instead of subject and ‘New Folder’ option in the folder column pane are followed. Error Prevention: Rather than users need to try confusing ‘Local Folders’ option to create a new folder, it is changed to ‘Thunderbird’ to prevent errors and clear ‘New Folder’ option is provided in the folder column for creating a folder. Recognition Rather than Recall: To change default settings for opening messages, users do not have to memorize a sequence of options now as it was in existing design. Instead, it can be done using the ‘Open Message’ from the main tool bar. Flexibility and Efficiency of use: is improved significantly in new design. For instance, ‘Open Message’ and ‘New Folder’ options are provided at hand, ‘Inbox’ is provided in the main tools instead of a tab (it makes it easier and more flexible to access anytime as while writing a message in a tab), and control buttons on each tab makes it more flexible to switch between the emails. Tools like Reply, Forward etc provided on the compact header gives affordance to the users to perform the tasks efficiently with very less mental effort.
4 Conclusion and Future Work In this paper, we studied the effectiveness of using the scenario based design (SBD) approach to improve the user interface (UI) design of the Thunderbird email system. Following the phases of SBD approach, the UI design improvements are suggested for a few very important and most discussed features. The comparative evaluation
Evaluation and Application of Scenario Based Design on Thunderbird
245
results using the KLM-GOMS show that the new design reduces mental effort and time required to do the important tasks significantly. For instance, replying, composing messages, locating address book, comparing emails and creating folders can be done more efficiently using the new design. Moreover, Nielsen’s heuristics when considered as rules of thumbs while redesigning can be very useful for improvement. The heuristic evaluation results show that the new UI design ideas satisfy most of the Nielsen’s usability heuristics, which in turn enhances usability. It can be implied by the empirical evaluation of two interfaces that the new design developed using SBD approach provides a number of improvements in terms of its flexibility, efficiency and ease of use. In the future, we plan to use SBD approach to evaluate and compare the UI design of a couple of popular email systems. An email system is also envisioned to design that will keep strengths of the existing systems and overcome weaknesses as much as possible.
References 1. Aditya, T., Ormeling, F.J., Kraak, M.J.: Advancing a National Atlas-Based Portal for Improved Use of a Geospatial Data Infrastructure: Applying Scenario-Based Development. In: Proceedings of Map Asia (2009) 2. Bardram, J.: Scenario-based Design of Cooperative Systems. In: Group Decision and Negotiation. LNCS, vol. 9(3), pp. 237–250. Springer, Heidelberg (1974) 3. Card, S.K., Moran, T.P., Newell, A.: The Psychology of Human-Computer Interaction. L. Erlbaum, Hillsdale (1983) 4. Carroll, J.M., Rosson, M.B.: Human-Computer Interaction Scenarios as a Design Representation. In: 23rd Hawaii International Conference on System Sciences, Software Track, pp. 555–561. IEEE Computer Society Press, Los Alamitos (1990) 5. Carroll, J.M.: Making Use: Scenario-Based Design of Human-Computer Interactions. MIT Press, Cambridge (2000) 6. Carroll, J.M., Rosson, M.B.: Usability Engineering: Scenario-Based Development of Human-Computer Interaction. Morgan Kaufmann, San Francisco (2002) 7. Community-powered Support for Mozilla Messaging (Online), http://getsatisfaction.com/mozilla_messaging (accessed: November 15, 2010) 8. Nielsen, J.: Ten Usability Heuristics (Online), http://www.useit.com/papers/heuristic/heuristic_list.html (accessed: January 15, 2010) 9. Petkovic, D., Raikundalia, G.K.: An Experience with Three Scenario-Based Methods: Evaluation and Comparison. International Journal of Computer Science and Network Security 9(1), 180–185 (2009) 10. Theng, Y.L., Goh, D.H., Lim, E.P., Liu, Z., Ming, Y., Pang, N.L.S., Wong, P.B.: Applying Scenario-based Design and Claims Analysis to the Design of a Digital Library of Geography Examination Resources. Information Processing and Management 41(1), 23– 40 (2005)
Improving Phenotype Name Recognition Maryam Khordad1 , Robert E. Mercer1 , and Peter Rogan1,2 1
Department of Computer Science 2 Department of Biochemistry The University of Western Ontario, London, ON, Canada {mkhordad,progan}@uwo.ca, [email protected]
Abstract. Due to the rapidly increasing amount of biomedical literature, automatic processing of biomedical papers is extremely important. Named Entity Recognition (NER) in this type of writing has several difficulties. In this paper we present a system to find phenotype names in biomedical literature. The system is based on Metamap and makes use of the UMLS Metathesaurus and the Human Phenotype Ontology. From an initial basic system that uses only these preexisting tools, five rules that capture stylistic and linguistic properties of this type of literature are proposed to enhance the performance of our NER tool. The tool is tested on a small corpus and the results (precision 97.6% and recall 88.3%) demonstrate its performance.
1
Introduction
During the last decade biomedicine has developed tremendously. Everyday a lot of biomedical papers are published and a great amount of information is produced. Due to the large number of applications of biomedical data, the need for Natural Language Processing (NLP) systems to process this amount of new information is increasing. Current NLP systems try to extract from the biomedical literature different knowledge such as, protein–protein interactions [1] [2] [3] [4] [5], new hypotheses [6] [7] [8], relations between drugs, genes and cells [9] [10] [11], protein structure [12] [13] and protein function [14] [15]. In all of these applications recognizing the biomedical objects or Named Entity Recognition (NER) is a fundamental step and obviously affects the final result. Over the past years it has turned out that finding the name of biomedical objects in literature is a difficult task. Some problematic factors are: the existence of millions of entity names, a constantly growing number of entity names, the lack of naming agreement prior to a standard name being accepted, an extreme use of abbreviations, the use of numerous synonyms and homonyms, and the fact that some biological names are complex names that consist of many words, like “increased erythrocyte adenosine deaminase activity”. Even biologists do not agree on the boundary of the names [16]. Named Entity Recognition in the biomedical domain has been extensively studied and, as a consequence, many methods have been proposed. Some methods like MetaMap [17] and mgrep [18] are generic methods and find all kinds of C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 246–257, 2011. c Springer-Verlag Berlin Heidelberg 2011
Improving Phenotype Name Recognition
247
entities in the text. Some methods, however, are specialized to recognize particular type of entities like gene or protein names [13] [19], diseases and drugs [9] [20] [21], mutations [22] or properties of protein structures [13]. NER techniques are usually classified into three categories [16]. Dictionarybased techniques like [19] match phrases from the text against some existing dictionaries. Rule-based techniques like [23] make use of some rules to find entity names in the text. And machine learning techniques like [24] transform the NER task into a classification problem. In this paper we want to focus on phenotype name recognition in biomedical literature. Phenotype is defined as the genetically-determined observable characteristics of a cell or organism, including the result of any test that is not a direct test of the genotype [25]. A phenotype of an organism is determined by the interaction of its genetic constitution and the environment. Skin color, height and behavior are some examples of phenotypes. We are developing a system that uses existing databases (UMLS Metathesaurus[26] and Human Phenotype Ontology (HPO) [27]) to find phenotype names1 . Our tool is based on MetaMap[17] to find name phrases and their semantic types. The tool uses these semantic types and some stylistic and linguistic rules to find human phenotype names in the text.
2
Phenotype Name Recognition
The last few years have seen a remarkable growth of NER techniques in the biomedical domain. However, these techniques tend to emphasize finding the name of genes, proteins, diseases and drugs. Although many specialized dictionaries are available, we are not aware of a dictionary which is both comprehensive and ideally suited for phenotype name recognition. For example, The Unified Medical Language System (UMLS) Metathesaurus [26] is a very large, multi-purpose, and multi-lingual vocabulary database that contains more than 1.8 million concepts. These concepts come from more than 100 source vocabularies. The Metathesaurus is linked to the other UMLS Knowledge Sources – the Semantic Network and the SPECIALIST Lexicon. All concepts in the Metathesaurus are assigned to at least one semantic type from the semantic network. However the semantic network does not contain Phenotype as a semantic type so it alone is not adequate to distinguish between phenotypes and other objects in text. In addition, some phenotype names do not exist in the UMLS Metathesaurus at all. The Online Mendelian Inheritance in Man (OMIM) [28] is the most important information source about human genes and genetic phenotypes [27]. Over five decades MIM and then OMIM has achieved great success and now it is used for the daily work of geneticists around the world. Nonetheless OMIM does not use a controlled vocabulary to describe the phenotypic features 1
This paper describes linguistic techniques to determine the sequence of words that is a descriptive phrase for a phenotype. A phenocopy is an environmental condition that mimics a phenotype and hence would have the same descriptive phrase as the phenotype name. We are not distinguishing between phenotype and phenocopy.
248
M. Khordad, R.E. Mercer, and P. Rogan
in its clinical synopsis section that makes it inappropriate for data mining usages [27]. The Human Phenotype Ontology (HPO) [27] is an ontology that was developed using information from OMIM and is specially related to human phenotypes. The HPO contains approximately 10,000 terms. Nevertheless this ontology is not complete and we had several problems finding phenotype names in it. First, some acronyms and abbreviations are not available in the HPO. Second, although the HPO contains synonyms of phenotypes, there are still some synonyms that are not included in the HPO. For example the HPO contains ENDOCRINE ABNORMALITY, but not ENDOCRINE DISORDER. Third, in some cases adjectives and other modifiers are added to phenotype names, making it difficult to find these phenotype names in the ontology. For example, ACUTE LEUKEMIA is in the HPO, but an automatic system would not suggest that ACUTE MYLOID LEUKEMIA is a phenotype simply by searching in the HPO. Fourth, new phenotypes are being continuously introduced to the biomedicine world. HPO is being constantly refined, corrected, and expanded manually, but this process is not fast enough nor can the inclusion of new phenotypes be guaranteed.
3 3.1
Background Named Entity Recognition
Named entities are phrases that contain the name of people, companies, cities, etc., and specifically in biomedical text entities such as genes, proteins, diseases, drugs, or organisms. Consider the following sentence as an example: – The RPS19 gene is involved in Diamond-Blackfan anemia. There are two named entities in this sentence: RPS19 gene and Diamond-Blackfan anemia. Named Entity Recognition (NER) is the task of finding references to known entities in natural language text. An NER technique may consist of some natural language processing methods like part-of-speech (POS) tagging and parsing. Part-of-speech tagging is the process of assigning a part-of-speech or other syntactic class marker to each word in the text [29]. A part-of-speech is a linguistic category of words such as noun, verb, adjective, preposition, etc. which is generally defined by the syntactic or morphological behavior of the word. Parsing is the process of syntactic analysis that recognizes the structure of sentences with respect to a given grammar. Using parsing we can find which groups of words are for example noun phrases and which ones are verb phrases. Complete and efficient parsing is beyond the capability of current parsers. Shallow parsing is an alternative. Shallow parsers decompose each sentence partially into some phrases and after that they find the local dependencies between phrases. They do not analyze the internal structure of phrases. Each phrase is tagged by one of a set of predefined
Improving Phenotype Name Recognition
249
grammatical tags such as Noun Phrase, Verb Phrase, Prepositional Phrase, Adverb Phrase, Subordinated clause, Adjective Phrase, Conjunction Phrase, and List Marker [30]. An important syntactic concept that is applied in our tool is the head of a phrase. The head is the central word in a phrase that determines the syntactic role of the whole phrase. For example in both phrases “low set ears” and “the ears” ears is the head. 3.2
MetaMap
MetaMap [17] is a widely used program developed by the National Library of Medicine (NLM). MetaMap provides a link between biomedical text and the structured knowledge in the Unified Medical Language System (UMLS) Metathesaurus by mapping phrases in the text to concepts in the UMLS Metathesaurus. To achieve this goal it analyzes the input text in some lexical and semantical steps. First, MetaMap tokenizes the input text. In the tokenization process the input text is broken into meaningful elements, like words. After part-of-speech tagging and shallow parsing using the Specialist Lexicon, MetaMap has broken the text into phrases. Phrases undergo further analysis to allow mapping to UMLS concepts. Each phrase is mapped to a set of candidate concepts and scores are calculated that represent how well the phrase matches the candidates. An optional last step is word sense disambiguation (WSD) which chooses the best candidate with respect to the surrounding text [17]. MetaMap is configurable and there are some options for vocabularies and data models in use, output format and algorithmic computations. Human-readable output is one of the output formats. MetaMap’s human-readable output generated from the input text “at diagnosis.” in the sentence “The platelet and the white cell counts are usually normal but neutropenia, thrombopenia or thrombocytosis have been noted at diagnosis.” is shown in Fig. 1. As you see MetaMap found 6 candidates for this phrase and finally after WSD it mapped the phrase to the “diagnosis aspect” concept. In UMLS each Metathesaurus concept is assigned to at least one semantic type. In Fig. 1 the semantic type of each concept is given in the preceding brackets. Semantic types are categorized into some groups that are subdomains of biomedicine such as Anatomy, Living Beings and Disorders [31]. These groups are called Semantic Groups (SG). Each semantic type belongs to one and only one SG. 3.3
Human Phenotype Ontology (HPO)
An ontology, defined in Artificial Intelligence and related areas, is a structured representation of knowledge in a domain. In fact an ontology is a structure of concepts and the relationships among them. The Human Phenotype Ontology (HPO) [27] is an ontology that tries to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease. The HPO was constructed using information initially obtained from the Online Mendelian Inheritance in
250
M. Khordad, R.E. Mercer, and P. Rogan
Phrase: "at diagnosis." >>>>> Phrase diagnosis <<<<< Phrase >>>>> Candidates Meta Candidates (6): 1000 Diagnosis [Finding] 1000 Diagnosis (Diagnosis:Impression/interpretation of study: Point in time:^Patient:Narrative) [Clinical Attribute] 1000 Diagnosis (Diagnosis:Impression/interpretation of study: Point in time:^Patient:Nominal) [Clinical Attribute] 1000 diagnosis (diagnosis aspect) [Qualitative Concept] 1000 DIAGNOSIS (Diagnosis Study) [Research Activity] 928 Diagnostic [Functional Concept] <<<<< Candidates >>>>> Mappings Meta Mapping (1000): 1000 diagnosis (diagnosis aspect) [Qualitative Concept] <<<<< Mappings
Fig. 1. MetaMap output for “at diagnosis”
Man (OMIM) [28] after which synonym terms were merged and the hierarchical structure was created between terms according to their semantics. The hierarchical structure in the HPO represents the subclass relationship. The HPO currently contains over 9500 terms describing phenotypic features.
4
Proposed Method
The development of our system began when we could not find a comprehensive resource for phenotype name recognition. In order to recognize phenotype names (e.g. “thumb duplication”) in the literature we integrated the available knowledge in the UMLS Metathesaurus and the HPO. By examining the positive and negative results we developed five additional rules. When using them the performance of our system improved significantly. A block diagram showing our system processing is shown in Fig. 2. The system performs the following steps: I MetaMap chunks the input text into phrases and assigns the UMLS semantic types associated with each noun phrase. We used the strict model and word sense disambiguation embedded in MetaMap. II The Disorder Recognizer analyzes the MetaMap output to find phenotypes and phenotype candidates. This part is original to our system and is described in detail in Section 4.1. III OBO-Edit [32] is an open source Java program that provides some facilities to edit or search in ontology files in OBO format. In this step phenotype candidates from the previous step are searched in the HPO. Phenotype candidates that are found in the HPO are recognized as phenotypes. IV Result Merger merges the phenotypes found by disorder recognizer and OBO-Edit and makes the output that is the final list of available phenotypes in the input text.
Improving Phenotype Name Recognition
251
Fig. 2. System block diagram
4.1
Disorder Recognizer
After processing the input text by MetaMap, a semantic type has been assigned to each phrase. The UMLS Semantic Network contains 133 Semantic Types. Unfortunately, Phenotype is not available in these semantic types and it is not easy for non-experts to determine which semantic types are related to phenotypes. These semantic types are categorized into 15 Semantic Groups (SG) [31] that are more general and more comprehensive for non-experts. The Semantic Group Disorders contains semantic types that are close to the meaning of phenotype. This semantic group has been used elsewhere [33] to map terminologies between the Mammalian Phenotype Ontology (MPO)[34] and the Online Mendelian Inheritance in Man (OMIM) [28]. The Semantic Group Disorders contains the following semantic types: Acquired Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease or Syndrome, Experimental Model of Disease, Finding, Injury or Poisoning, Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function, Sign or Symptom. Our initial system used MetaMap and the Semantic Group Disorders to recognize phenotypes. However, a number of errors remained with this rudimentary system. After some analysis of these errors, it was decided to apply some post processing steps to overcome the remaining problems: 1. A number of errors were caused by the use of acronyms. MetaMap has the ability to recognize the acronym references but its database does not contain all acronyms. In addition, some acronyms are used for more than one concept and this ambiguity causes problems for MetaMap. Typically, papers indicate the local unambiguous reference for each acronym used at its first usage. Using this knowledge, we create a list of acronym references for each paper using BioText [35] and use this list to process the acronyms found in the remainder of the text. So the first rule is:
252
2.
3.
4.
5.
M. Khordad, R.E. Mercer, and P. Rogan
Rule 1. Resolve the acronym referencing problem by making and using a list of acronyms occurring in each paper. Several phenotypes are phrases containing more than one biomedical or clinical term. The complete phrase of some of these phenotypes are not available in the UMLS. The UMLS often finds separate concepts for the biomedical and clinical terms in these phrases. Fig. 3 represents the UMLS output for “[The] presented learning disabilities”. There are two separate concepts in the MetaMap output. The first one is “presented” which is assigned to the semantic type [Idea or Concept] and the second one is “learning disabilities” with the semantic type [Mental or Behavioral Dysfunction]. As “presented” is only an adjective for “learning disabilities” in this case, the whole phrase should be considered as one phenotype. So, in these situations the semantic type of the noun phrase head is the most important part and our system should consider the head’s semantic type in order to recognize the semantic type of the whole phrase. So we have the rule: Rule 2. The semantic type of a noun phrase is the semantic type assigned by Metamap to its head. Some phenotypes like “large ventricles” that are not recognized by MetaMap follow a common template. They begin with special modifiers followed by terms that have the Semantic Groups Anatomy or Physiology. This class of phenotypes is mentioned in [31] where a list of 100 special modifiers, having to do with some sort of unusual aspect or dysfunction (like “large”, “defective” and “abnormal”), is given. This list was developed by noticing the modifiers that occur most frequently with MPO terms. For our purposes, we found the list incomplete and we have added three more modifiers found in our small corpus. The three added terms are “missing”, “malformed”, and “underdeveloped”. More modifiers will need to be included. The rule is: Rule 3. If a phrase is “modifier (from the list of special modifiers) + [Anatomy] or [Physiology]” it is a phenotype name. A number of the semantic types in the Semantic Group Disorder include concepts that are not phenotypes, leading to false positives in phenotype name recognition. For example MetaMap assigns “responsible” to the semantic type “Finding”. The word “responsible” is clearly not a phenotype. On the other hand “overgrowth”, which is a phenotype, is assigned to the semantic type “Finding”, too. The problematic semantic groups are: Finding, Disease or Syndrome, Experimental Model of Disease, Injury or Poisoning, Sign or Symptom, Pathologic Function, and Cell or Molecular Dysfunction. Therefore, if a phrase is assigned to these semantic types we cannot be sure that it is a phenotype. We consider the phrases in these semantic types as phenotype candidates that need further analysis. A search for phenotype candidates in the HPO in step III of the process described above confirms whether each phenotype candidate is a phenotype or not. If a phenotype candidate is found in the HPO, it is recognized as a phenotype. While making the candidates list we should consider rules 4 and 5 below. In some cases the phenotype is in plural form but only the singular form is available in the HPO. One example is “deep set eyes”. It is not in the HPO
Improving Phenotype Name Recognition
253
but “deep set eye” is. So, if the singular form is available in the HPO the plural form is a phenotype. Rule 4. If the single form of a phrase is a phenotype the plural form is a phenotype, too. 6. A phenotype candidate may contain adjectives and adverbs in addition to the phenotype as found in the HPO. In these situations the complete phrase may not be in HPO. So the system will remove the adjectives and adverbs in the phenotype candidate and search for the head of the phrase. Rule 5. If the head of a phenotype candidate phrase is a phenotype, the whole phrase is a phenotype. In summary, the system analyzes all noun phrases, one by one. If the phrase contains an acronym, the reference for the acronym is first resolved using Rule 1. If the phrase matches Rule 3, it is added to the phenotype list, otherwise the semantic type of the phrase is identified by the semantic type of its head according to Rule 2. If the semantic type is in the Semantic Group Disorder, the phrase is recognized as either a phenotype or a phenotype candidate. Phenotype candidates are added to the phenotype candidate list along with their heads and their singular form if they are plural (according to Rules 4 and 5), to be processed in step III. Phrase: "[The] presented learning disabilities" >>>>> Phrase presented learning disabilities <<<<< Phrase >>>>> Candidates Meta Candidates (9): 901 Learning Disabilities [Mental or Behavioral Dysfunction] 882 Learning disability (Learning disability - specialty) [Biomedical Occupation or Discipline] 827 Learning [Mental Process] 827 Disabilities (Disability) [Finding] 743 Disabled (Disabled Persons) [Patient or Disabled Group] 743 Disabled [Qualitative Concept] 660 Presented (Presentation) [Idea or Concept] 627 Present [Quantitative Concept] 627 Present (Present (Time point or interval)) [Temporal Concept] <<<<< Candidates >>>>> Mappings Meta Mapping (901): 660 Presented (Presentation) [Idea or Concept] 901 Learning Disabilities [Mental or Behavioral Dysfunction] <<<<< Mappings
Fig. 3. An example of Rule 1
5
Evaluation
The system has been evaluated on a corpus containing 120 sentences with 110 phenotype phrases. These sentences are collected from 4 random full text journal articles specialized in human genetics. Not all these sentences contain
254
M. Khordad, R.E. Mercer, and P. Rogan Table 1. Results Method
Precision Recall F-measure
Basic Form Applying Only Rule Applying Only Rule Applying Only Rule Applying Only Rule Applying Only Rule Applying All Rules
1 2 3 4 5
88.78 89.38 97.19 89.09 88.9 89.38 97.58
74.21 78.9 75.91 76.56 75.78 78.9 88.32
80.84 83.81 85.24 82.35 81.32 83.81 92.71
Table 2. Three sources of errors Cause of Error Example of Error
Description of Error
MetaMap parser Partial hypoplasia of the corpus callosum
MetaMap finds two separate phrases: “Partial hypoplasia”; “of the corpus callosum” MetaMap finds two separate phrases: “missing”; “vertebrae” [Functional Concept] chosen instead of [Disease or Syndrome] [Gene or Genome] chosen instead of [Congenital Abnormality] [Gene or Genome] chosen instead of [Disease or Syndrome] [Gene or Genome] chosen instead of [Neoplastic Process] [Functional Concept] chosen instead of [Disease or Syndrome]
missing vertebrae MetaMap WSD learning deficit triphalangeal thumb aplastic anemia osteosarcoma diabetes insipidus Phenotype candidates not in HPO
thumb duplication
%age 20
25
25
thrombopenia increased erythrocyte adenosine deaminase activity macrocytosis
phenotypes. Precision, recall and F-measure are typically used to measure the performance of NER tools. Precision is the percentage of correct entity names in all entity names found and can be seen as a measure of soundness. Recall is the percentage of correct entity names found compared to all correct entity names in the corpus and can be used as a measure of completeness. F-measure is the harmonic mean of equally weighted precision and recall. The performance of differently configured systems is shown in Table 1. The basic form is the integration of UMLS and HPO using none of the rules discussed above. The results
Improving Phenotype Name Recognition
255
of adding each of the rules are listed in the table. Some errors result from inadequacies in our method, but other errors are caused by incorrect information provided by the systems that we use. Examples of phenotype names not found by our tool as a result of MetaMap mistakes and HPO incompleteness are shown in Table 2. Some errors are a result of an incorrect parse. In some cases MetaMap has true candidates but after WSD a wrong candidate is chosen. And finally in several examples MetaMap finds reasonable phenotype candidates but they are not found in the HPO. The percentage of total errors that these three sources cause are shown in the table.
6
Summary
Biomedical literature is an important source of information that is growing rapidly. The need for automatic processing of this amount of information is undeniable. One of the basic obstacles to achieve this aim is the recognition of biomedical objects in text. We have presented a system to improve phenotype name recognition. This system integrates two knowledge sources, UMLS and HPO, and MetaMap in an innovative way to find phenotype names in biomedical texts. In essence, our approach applies specific rules to enhance recognition of named entities which originate from specific dictionaries and ontologies. To test the performance of this system, a small corpus has been used giving recognition results of 97.6% precision and 88.3% recall. BioMedLEE [36] is a system that extracts a broad variety of phenotypic information from biomedical literature. This system was adapted from MedLEE [37] a clinical information extraction NLP system. To evaluate BioMedLEE, 300 randomly chosen journal titles were used and BioMedLEE had 64% precision and 77.1% recall. We wanted to compare the performance of our system against this reported performance, but we did not have access to the software nor to the corpus used in [36]. In some cases the errors are caused by the tools used: the MetaMap parser and Word Sense Disambiguation function, and incompleteness of the HPO. Our future aim is to find solutions to solve these remaining problems and to improve the accuracy of our system. In addition we plan to make a larger corpus and evaluate the performance of our system more accurately and compare to BioMedLEE’s performance on this corpus.
References 1. Leroy, G., Chen, H., Martinez, J.D.: A shallow parser based on closed-class words to capture relations in biomedical text. Journal of Biomedical Informatics 36(3), 145–158 (2003) 2. He, X., DiMarco, C.: Using lexical chaining to rank protein-protein interactions in biomedical texts. In: BioLink 2005: Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Conference of the Association for Computational Linguistics (2005) (poster Presentation) 3. Fundel, K., K¨ uffner, R., Zimmer, R.: Relex - relation extraction using dependency parse trees. Bioinformatics 23(3), 365–371 (2007)
256
M. Khordad, R.E. Mercer, and P. Rogan
4. Ng, S.K., Wong, M.: Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Informatics 10, 104–112 (1999) 5. Yu, H., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract protein-protein interactions from the literature: Part ii. Bioinformatics 21(15), 3294–3300 (2005) 6. Swanson, D.R.: Fish oil, Raynauds syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine 30(1), 7–18 (1986) 7. Hristovski, D., Peterlin, B., Mitchell, J.A., Humphrey, S.M.: Using literature-based discovery to identify disease candidate genes. I. J. Medical Informatics 74(2-4), 289–298 (2005) 8. Hristovski, D., Friedman, C., Rindflesch, T.C., Peterlin, B.: Exploiting semantic relations for literature-based discovery. In: AMIA Annual Symposium Proceedings, pp. 349–353 (2006) 9. Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L.: Edgar: Extraction of drugs, genes and relations from the biomedical literature. In: Pacific Symposium on Biocomputing, vol. 5, pp. 514–525 (2000) 10. Friedman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A.: GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics (Oxford, England) 17(suppl. 1), S74–S82 (2001) 11. Tanabe, L., Scherf, U., Smith, L.H., Lee, J.K., Hunter, L., Weinstein, J.N.: MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27(6) (1999) 12. Humphreys, K., Demetriou, G., Gaizauskas, R.: Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. In: Pacific Symposium on Biocomputing, pp. 505–516 (2000) 13. Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Willett, P.: Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 19(1), 135–143 (2003) 14. Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600–607 (1998) 15. Valencia, A.: Automatic annotation of protein function. Current Opinion in Structural Biology 15(3), 267–274 (2005) 16. Leser, U., Hakenberg, J.: What makes a gene name? named entity recognition in the biomedical literature. Briefings in Bioinformatics 6(4), 357–369 (2005) 17. Aronson, A.R.: Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. In: AMIA Annual Symposium Proceedings, pp. 17–21 (2001) 18. Dai, M., Shah, N.H., Xuan, W., Musen, M.A., Watson, S.J., Athey, B.D., Meng, F.: An efficient solution for mapping free text to ontology terms. In: AMIA Summit on Translational Bioinformatics, San Francisco, CA (2008) 19. Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST for identifying gene and protein names in journal articles. Gene 259(1-2), 245–252 (2000) 20. Xu, R., Supekar, K., Morgan, A., Das, A., Garber, A.: Unsupervised method for automatic construction of a disease dictionary from a large free text collection. In: AMIA Annual Symposium Proceedings, pp. 820–824 (2008) 21. Segura-Bedmar, I., Martnez, P., Segura-Bedmarr, M.: Drug name recognition and classification in biomedical texts: A case study outlining approaches underpinning automated systems. Drug Discovery Today 13(17-18), 816–823 (2008)
Improving Phenotype Name Recognition
257
22. Horn, F., Lau, A.L., Cohen, F.E.: Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics 20(4), 557–568 (2004) 23. Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T.: Toward information extraction: identifying protein names from biological papers. In: Pacific Symposium Biocomputing, pp. 707–718 (1998) 24. Nobata, C., Collier, N., Tsujii, J.: Automatic term identification and classification in biology texts. In: The 5th NLPRS Proceeding, pp. 369–374 (1999) 25. Strachan, T., Read, A.: Human Molecular Genetics, 3rd edn. Garland Science/Taylor & Francis Group (2003) 26. Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., Barnett, G.O.: The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5(1), 1–11 (1998) 27. Robinson, P.N., Mundlos, S.: The human phenotype ontology. Clinical Genetics 77(6), 525–534 (2010) 28. McKusick, V.: Mendelian Inheritance in Man and Its Online Version, OMIM. The American Journal of Human Genetics 80(4), 588–604 (2007) 29. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, 2nd edn. Prentice Hall, Englewood Cliffs (2008) 30. Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: an overview. J. Comput. Biol. 10(6), 821–855 (2003) 31. McCray, A.T., Burgun, A., Bodenreider, O.: Aggregating UMLS Semantic Types for Reducing Conceptual Complexity. Proceedings of Medinfo. 10(pt 1), 216–220 (2001) 32. Day-Richter, J., Harris, M.A., Haendel, M., Obo, T.G.O., Lewis, S.: OBO-Edit an ontology editor for biologists. Bioinformatics 23(16), 2198–2200 (2007) 33. Burgun, A., Mougin, F., Bodenreider, O.: Two approaches to integrating phenotype and clinical information. In: AMIA Annual Symposium Proceedings, pp. 75–79 (2009) 34. Smith, C., Goldsmith, C.A., Eppig, J.: The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biology 6(1), R7+ (2004) 35. Schwartz, A.S., Hearst, M.A.: A simple algorithm for identifying abbreviation definitions in biomedical text. In: Pacific Symposium on Biocomputing, pp. 451–462 (2003) 36. Chen, L., Friedman, C.: Extracting phenotypic information from the literature via natural language processing. Medinfo. 11(Pt 2), 758–762 (2004) 37. Friedman, C., Alderson, P.O., Austin, J.H., Cimino, J.J., Johnson, S.B.: A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association 1(2), 161–174 (1994)
Classifying Severely Imbalanced Data William Klement1 , Szymon Wilk2 , Wojtek Michalowski3, and Stan Matwin4, 1
3
Thomas Jefferson Medical College, PA, USA [email protected] 2 Poznan University of Technology, Poland [email protected] Telfer School of Management, Uni. of Ottawa, Canada [email protected] 4 SITE, University of Ottawa, Canada [email protected]
Abstract. Learning from data with severe class imbalance is difficult. Established solutions include: under-sampling, adjusting classification threshold, and using an ensemble. We examine the performance of combining these solutions to balance the sensitivity and specificity for binary classifications, and to reduce the MSE score for probability estimation. Keywords: Classification, Class Imbalance, Sampling, Ensembles.
1
Introduction
In medical domains, severe class imbalance is common and is difficult to cope with. For example, classifying head injury patients to assess their needs for CT scans, or examining mammogram images to detect breast cancer, or predicting heart failures, etc. are all tasks that deal with severely imbalanced data because usually, there are fewer patients who suffer from an acute condition (positives) than not (negatives). Other domains face this problem also; fraud detection [7], anomaly detection, information retrieval [13], and detecting oil spills [12] are few examples. When faced with severe class imbalance, most machine learning methods struggle to achieve a balanced performance. The distinction between the problem of severe class imbalance and the problem of small minority class is very crucial. Often, the problem of insufficient data occurs in conjunction with severe class imbalance. Dealing with both problems is a major difficulty. The paper presents an experimental evaluation of selected methods used by researchers to counter class imbalance. Our experiment examines combining three techniques: under-sampling, classification threshold selection, and using an ensemble of classifiers (by averaging their predicted probabilities) to assist the Naive Bayes method to overcome the imbalance problem. The Naive Bayes method is favored because it computes probabilities, and thus, allows for the assessment of probability estimation. It is common for medical practitioners to reply on probabilistic estimates when making a diagnosis. Initially, this study
Is affiliated with the Institute of Computer Science, Polish Academy of Sciences.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 258–264, 2011. c Springer-Verlag Berlin Heidelberg 2011
Classifying Severely Imbalanced Data
259
was presented in [10] (unarchived publication) with preliminary results using an ensemble of exactly ten classifiers. This paper presents similar results with extensive experiments and calculates the number of members in the ensemble based on the class imbalance in the data. This paper also examines the performance with respect to probability scores. Based on our study, our recommendations have successfully been used in [1]. Our results show that combining under-sampling, threshold selection, and ensemble learning is effective in achieving a balance in classification performance on both classes simultaneously in binary domains. Furthermore, the results suggest that adjusting the classification threshold reduces the mean squared errors of probability estimates computed by the Naive Bayes classifier. After a brief review of classification under class imbalance, a description of our experiment design and results follow, and we close with conclusive remarks.
2
Classification with Severe Class Imbalance
In the presence of class imbalance, Provost [15] attributes the struggle of machine learning methods to maximizing the accuracy, and to the assumption that a classifier will operate in the same distribution as training. Thus, a classifier is likely to predict the majority class [6]. In medical decision-making domains, the minority class is usually the critical class, which we wish to predict with higher sensitivity at little or no cost of specificity. Therefore, the accuracy is not only skewed by the imbalance, but also inappropriate. From the many proposed solutions, we only review those relevant to this paper due to space constraints. However, a comprehensive review of this problem can be found in [9]. Most solutions rely on either adjusting the data for balancing or on modifying the learning algorithm to become sensitive to the imbalance. Data sampling relates to the identification of the correct distribution for a learning algorithm, and the appropriate sampling strategy for a given data set [15,3]. Sampling modifies the class distribution in the training data by increasing or decreasing the frequency of one class. Respectively, these are known as over-sampling and undersampling. While the former is based on generating instances in the minority class (by duplication or by sophisticated data generation schemes), the latter relies on removing examples in the majority class. However, under-sampling is shown to outperform over-sampling [6], and over-sampling can lead to over-fitting [2]. Alternative solutions modify the learning algorithm to address class imbalance. Cost-based learning is among such techniques where instances in the minority class are assigned misclassification costs different than those in the majority class [4,5,14,18]. However, determining the cost of one class over the other (particularly in medical domains) is a difficult problem, which we avoid in this paper. We have addressed this issue in the cost-based sampling study published in [11]. Finally, adjusting the probability estimation or adjusting classification threshold can also help counter the problem [15].
260
3
W. Klement et al.
Measures against Class Imbalance
To counter the effect of class imbalance, Provost [15] calls for adjusting the decision threshold when the classifier produces probability estimates rather than labels [8], a case in which classifications are made by imposing a threshold on the probabilities. In this study, the threshold selection (TS) is based on maximizing the F-measure, which represents the harmonic mean of precision and recall. While recall represents the true positive rate, precision is the fraction of true positives of those instances predicted as positive by the classifier. Thus, maximizing the F-measure for the minority class leads to maximizing precision and recall, and the performance on both classes is expected to be balanced. Our under-sampling approach (US) balances the training data by keeping all points in the minority class while randomly selecting, without replacement, an equal proportion of points from the majority class. If the data contains n points and n+ of those are in the minority class, then, TS selects a balanced sample of s = 2n+ percent of n without replacement. Effectively, this selects all n+ and a random sample of the majority class of an equal proportion. However, random under-sampling can potentially eliminate information form the training set by excluding instances from the majority class. To counter this potential harm, we construct an ensemble model (EN) consisting of m Naive Bayes (NB) classifiers constructed from various data samples consisting of the entire minority class and an equal, random subset of the majority class. Effectively, members in this ensemble have similar expertise on the minority class but various skills on the majority class. Then, the ensemble of m models is combined by averaging their predicted probabilities. In our experiment, the number m is based on the class imbalance ratio. For instance, if n+ (the positives) is less than n− (the negatives), then m is the closest integer value to nn+ .
4
Experiments and Results
Our experiment aims to identify which combinations of TS, US, and EN can help the Naive Bayes method (NB) achieve a balanced classification performance on imbalanced data. By “balanced” we mean as equally high performance as possible on both classes. NB is used because of its ability to produces probabilities. This feature enables us to assess the performance of probability estimates for our models. All our models are trained and tested using the Weka [17] software. We examine the performance of all learning models (there are eight) which combine TS, US, and EN techniques. The first is NB which is a single Naive Bayes classifier trained on the original imbalanced data with none of these techniques. This model helps establish a baseline performance. The (TS) model is a single Naive Bayes whose classification threshold is adjusted to maximize the F-measure. US is a single Naive Bayes classifier trained on a sample of the training set obtained as described in Section 3. USTS is the last single Naive Bayes model and combines both US and TS techniques. USEN is an ensemble of m Naive Bayes learners where m is defined in Section 3 using under-sampling prior to training
Classifying Severely Imbalanced Data
261
Table 1. The data contain n points, n+ are positive, s is a percentage of n for undersampling, and m is the number of members of the constructed ensemble Data MedD SPECT Adult DIS 8HR HPT HYP 1HR SIKEU SIK
n 409 267 40498 3772 2534 155 3163 2536 3163 3772
n+ 48 55 9640 58 160 32 151 73 293 231
+
s = 2 nn 21 41 47.6 3 12.6 41.3 9.5 5.8 18.5 12.2
m = nn+ 9 5 4.2 65 16 3 21 35 11 66
Description Undisclosed prospective medical data SPECT images [16] Adult data [16] Thyroid disease data [16] Ozone Level Detection [16] Hepatitis Domain [16] Hypo-thyroid disease [16] Ozone Level Detection [16] Sick-Euthyroid data [16] Sick Thyroid disease [16]
each member in the ensemble. Finally, the USTSEN combines all three techniques together, i.e., the USTSEN model is an ensemble similar to USEN but with the addition of TS to each member in the ensemble. The remaining combinations EN and TSEN are omitted because, and in addition to space limitations, they failed to produce performance different than NB and TS respectively. Our experiment includes ten binary classification data sets listed in Table 1. They are mostly obtained from the UCI repository [16] with the exception of MedD data. The latter is a prospectively collected medical dataset that describes an acute patient condition. Unfortunately, we are unable to disclose any details for MedD data due to intellectual property and privacy issues. The models NB, TS, US, USTS, USEN, and USTSEN are each tested with 10-fold cross-validation runs executed 1000 times. In each run, we record (as a percentage) the sensitivity, the specificity, the mean squared error (MSE), and the area under the ROC curve (AUC). The latter shows little to no change in value, and therefore, we omit its results due to space limitations. As a summary, we present the average and standard deviations over the one thousand runs for the remaining metrics. It is important to note that from a medical perspective, the focus of performance is on sensitivity first, then on specificity second. The sensitivity and specificity results are shown in Table 2. Values in bold indicate higher sensitivity than specificity. In the top part of the table, the NB and TS models suffer from the overwhelming negatives. In most cases, their specificities are much higher than their sensitivities. The US model shows a clear improvement in sensitivity while compromising little specificity. However, USTS, USEN, and USTSEN models achieve a clear advantage and consistently produce higher sensitivity with reasonable specificity. Moreover, ensemble models USEN and USTSEN show lower standard deviations than US and USTS. If we consider the sensitivity and specificity, the TS alone fails to counter the imbalance. But when combined with US (USTS), the performance improves. So what is TS good for? Consider the MSE results shown in Table 3 where bold values are the lowest. TS achieves lower MSE scores indicating better probability estimates which are useful particularly in medical domains. They represent
262
W. Klement et al. Table 2. Average Sensitivity / Average Specificity
Data NB MedD 66.8±0.7 / 91.8±0.3 SPECT 76.3±0.3 / 79.6±0.6 Adult 51.6±0.1 / 93.3±0.0 DIS 45.1±3.5 / 96.7±0.1 8HR 85.0±0.3 / 66.5±0.2 HPT 70.1±1.7 / 87.3±0.8 HYP 77.4±0.7 / 98.9±0.1 1HR 81.8±1.1 / 70.5±0.2 SIKEU 89.6±0.4 / 83.6±0.2 SIK 77.6±0.6 / 93.7±0.1 USTS MedD 87.7±3.1 / 81.7±1.7 SPECT 77.9±3.9 / 70.3±5.5 Adult 91.1±0.3 / 70.0±0.4 DIS 77.4±3.2 / 83.4±3.7 8HR 85.6±1.6 / 62.6±2.6 HPT 79.8±4.4 / 78.0±3.3 HYP 96.7±1.1 / 94.8±0.4 1HR 87.9±3.2 / 60.2±2.9 SIKEU 88.2±1.1 / 85.2±1.0 SIK 87.3±1.2 / 86.6±1.0
TS 60.3±3.5 / 92.5±1.0 66.3±2.6 / 86.3±1.0 77.8±0.5 / 82.3±0.3 60.0±3.5 / 95.8±0.2 48.7±1.4 / 89.8±0.4 68.1±3.7 / 87.2±1.7 77.9±1.4 / 98.8±0.1 51.7±1.3 / 91.3±0.3 65.6±1.5 / 96.4±0.3 59.1±1.4 / 98.0±0.2 USEN 82.2±2.2 / 86.4±0.4 80.0±1.6 / 69.9±1.0 60.1±0.1 / 90.7±0.0 82.8±1.2 / 62.9±0.9 85.5±0.3 / 64.0±0.2 80.7±2.9 / 81.1±1.2 94.1±0.7 / 96.6±0.1 83.3±0.9 / 67.8±0.2 92.5±0.1 / 68.9±0.4 89.5±0.3 / 82.9±0.2
US 82.3±3.1 / 86.4±0.8 80.2±1.9 / 70.0±1.7 60.1±0.3 / 90.7±0.1 83.8±2.7 / 61.9±5.0 85.5±0.5 / 64.1±0.4 80.0±3.7 / 81.4±2.1 93.3±1.5 / 96.5±0.3 83.3±1.4 / 67.9±0.6 92.5±0.2 / 68.9±1.1 89.7±0.8 / 82.3±1.0 USTSEN 88.6±1.8 / 83.9±0.6 78.1±1.8 / 71.3±2.1 90.2±0.2 / 71.4±0.2 80.4±1.2 / 83.5±0.5 85.3±0.4 / 64.4±0.3 80.8±3.0 / 79.8±1.4 96.9±0.6 / 95.5±0.1 84.6±1.1 / 65.9±0.3 90.3±0.5 / 82.7±0.4 88.2±0.4 / 86.4±0.2
Table 3. Average Mean Squared Error (MSE) Data MedD SPECT Adult DIS 8HR HPT HYP 1HR SIKEU SIK
NB 8.2±0.1 17.7±0.2 13.9±0.0 3.7±0.0 31.7±0.1 13.7±0.5 1.8±0.0 28.5±0.2 11.2±0.1 5.3±0.0
TS 8.0±0.4 14.8±0.4 12.8±0.0 4.1±0.1 15.0±0.7 13.8±0.7 1.8±0.0 10.2±0.9 5.8±0.1 3.6±0.1
US 11.6±0.5 22.8±0.9 13.3±0.1 27.4±3.6 33.7±0.4 16.4±1.3 3.0±0.2 30.9±0.6 22.8±0.9 13.4±0.7
USTS 13.5±0.8 22.2±2.2 14.7±0.1 15.7±2.4 33.2±1.0 17.4±1.4 3.7±0.2 32.8±1.2 12.9±0.6 11.0±0.6
USEN 11.3±0.2 22.4±0.4 13.2±0.0 23.9±0.4 33.3±0.1 15.1±0.7 2.7±0.0 30.0±0.2 22.5±0.3 12.7±0.1
USTSEN 13.0±0.3 20.9±0.9 14.7±0.0 13.6±0.2 32.1±0.3 15.7±0.8 3.4±0.0 31.4±0.3 12.7±0.2 10.4±0.1
an estimate of how likely a patient belongs to the positive class, in this case, the minority class. Models using under-sampling (US, USTS, USEN, USTSEN) produce higher MSE scores. They are able to counter the class imbalance for classification but not necessarily for probability estimation. In addition, USEN and USTSEN show lower MSE deviations than US and USTS. Although undersampling increases the standard deviation due to the random exclusion of data points, the construction of an ensemble model seems to provide a remedy.
Classifying Severely Imbalanced Data
5
263
Conclusions
Combining under-sampling with threshold selection while using a voted ensemble successfully shifts the focus of Naive Bayes to the minority class. This combination builds an effective model when dealing with severe class imbalance. Sampling increases performance deviations but the ensemble provides a remedy. Adjusting the classification threshold alone fails to counter the imbalance for classification, but it succeeds for probability estimation by reducing MSE. Future experiments may include other models, e.g., decision trees or rule-based methods.
References 1. Blaszczy´ nski, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating Selective Pre-processing of Imbalanced Data with Ivotes Ensemble. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 148–157. Springer, Heidelberg (2010) 2. Chawla, N.V., Japkowicz, N., Kolcz, A. (eds.): ICML 2003 Workshop on Learning from Imbalanced Data Sets (2003) 3. Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery 17(2), 225–252 (2008) 4. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: KDD 1999, pp. 155–164 (1999) 5. Drummond, C., Holte, R.C.: Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: ICML 2000, pp. 239–246 (2000) 6. Drummond, C., Holte, R.C.: Severe Class Imbalance: Why Better Algorithms Aren’t the Answer. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 539–546. Springer, Heidelberg (2005) 7. Fawcett, T., Provost, F.: Adaptive Fraud detection. Data Mining and Knowledge Discovery (1), 291–316 (1997) 8. Flach, P.A., Matsubara, E.T.: A Simple Lexicographic Ranker and Probability Estimator. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 575– 582. Springer, Heidelberg (2007) 9. He, H., Garcia, E.A.: Learning form Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009) 10. Klement, W., Wilk, S., Michaowski, W., Matwin, S.: Dealing with Severely Imbalanced Data. In: ICEC 2009 Workshop, PAKDD 2009 (2009) 11. Klement, W., Flach, P., Japkowicz, N., Matwin, S.: Cost-based Sampling of Individual Instances. In: Canadian AI 2009, pp. 86–97 (2009) 12. Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning (30), 195–215 (1998) 13. Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: ICML 1994, pp. 179–186 (1994)
264
W. Klement et al.
14. Margineantu, D.: Class probability estimation and cost-sensitive classification decisions. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 270–281. Springer, Heidelberg (2002) 15. Provost, F.: Learning with Imbalanced Data Sets 101. Invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets (2000) 16. Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences (1998), http://www.ics.uci.edu/~ mlearn/MLRepository.html 17. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 18. Zadrozny, B., Langford, J., Abe, N.: Cost-Sensitive Learning by Cost-Proportionate Example Weighting. In: IEEE ICDM 2003 (2003)
Simulating Cognitive Phenomena with a Symbolic Dynamical System Othalia Larue GDAC Research Laboratory - Computer Science Department Universit´e du Qu´ebec a ` Montr´eal C.P. 8888, succursale Centre-ville, Montr´eal, QC, H3C 3P8 [email protected]
Abstract. We present a new tool: symbolic dynamical approach to the simulation of cognitive processes. Complex Auto-Adaptive System is a symbolic dynamical system implemented in a multi-agent system. We describe our methodology and prove our claims by presenting simulation experiments of the Stroop and Wason tasks. We then explain our research plan: the integration of an emotion model in our system, the implementation of a higher-level control organization, and the study of its application to cognitive agents in general. Keywords: Cognitive Simulation, Cognitive architecture, Symbolic dynamism.
1
Introduction
Complex Auto-Adaptive System (CAAS) [1] is a multi-agent system that models the dynamics of knowledge activation and suppression as the system performs tasks or solves problems. Our aim is to introduce CAAS and motivate it as a simulation tool for cognitive scientists. There already are simulations of cognitive processes in neural networks [2]; however, the opaque nature of neural processing makes it difficult to understand how the psychological processes emerge from the neural dynamics. Contrary to neural networks, and similarly to what the system presented here does in part, production systems [3] use a symbolic approach for the simulation of cognitive processes, making it easier to interpret activity in the system; but, the sequentiality inherent to fixed mechanisms for rules selection limits the simulation of cognitive processes. CAAS implements a symbolic dynamical system[4] which we believe to be a hybrid alternative since it allows the use of discrete-symbolic representations but exhibits dynamical change in time. In CAAS, Symbolic dynamics refers to the dynamical, real-time, interaction of populations of minimal symbolic agents, out of which emerges a continuously changing geometrical representation of the environment. We cannot claim the fine-grained neurological plausibility of some neural networks, but CAAS is neurocognitively plausible at the higher (functional) level of gross neurological structure and it does bring to our simulations the readability that neural network simulations lack. Thanks to the similarities between C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 265–268, 2011. c Springer-Verlag Berlin Heidelberg 2011
266
O. Larue
the mesoscopic dynamical activity of brains and CAAS dynamical activity, our simulations can also account for phenomena that are reproduced in a limited way in production systems.
2
Methodology
CAAS is implemented in a multi-agent system featuring a reactive agent architecture. Agents in CAAS are assigned to five organisations depending on their roles in the system: frontal agents, structuring agents, morphological agents, analysis agents, and PSE agents (Pattern of Specific Agents), each having the same basic structure (a communication module, a type and a state defined by its level of activation). The frontal agents extract knowledge from inputs to the system. Their state of activation is determined by the presence of information as input usable by the software or library it uses and by the structuring agents who send top-down perceptual control messages to them. The structuring agents bear the knowledge of the system (knowledge base of the system). Each structuring agent is assigned as its role one term (knowledge item) from the ontology (general or specific to the task). States of activation are determined by the amount of messages exchanged between agents and organisations. The morphological agents generate statistics about communications between the structuring agents and regulate the activation of the structuring agents. The analysis agents produce graphs of the communications between the agents with those statistics. Finally, the PSE agents orientate the communication organisation between the structuring agents. Each PSE agent is assigned a global shape (morphology) as its role. CAAS functions by comparing geometrical representations of the activity of its symbol-bearing Structuring agents with geometrical representations that serve as its goals in order to alter the activation of the agents of the Structuring organisation. Structuring organisation, however, is not only determined by the Frontal agents, but also by the PSE agents, which get Morphological agents to send activity-promoting or inhibiting signals to the Structuring agents according to which goal/goals are currently active (given to the system by its user). The cognitive control system located in the DLPFC is, in our system, the PSE organisation. The information processed in the posterior brain regions is represented in our system by the regulating messages sent from the Morphological organisation to the Structuring organisation through the goal oriented behaviour of the PSE organisation. With the following experiments: Stroop and Wason tasks, we prove the cognitive plausibility of our system in action, and its limitations.
3 3.1
Experiments and Results Stroop Task
The Stroop task is a standard psychology test that measures interference between competiting cognitive processes. We already have implemented a classical
Simulating Cognitive Phenomena with a Symbolic Dynamical System
267
Stroop task, and provide an accurate simulation of the time effects and working memory variations on cognitive control in “weakening” the system (by reducing the messages of regulation sent by the Morphological agents to the Structuring agents). We observe disorders similar to those reported in the literature concerned with the functional understanding of cognitive disorders and differences in individuals. Such experiments are a clear demonstration of our previous claim about the ability of the system to correctly reproduce the mesoscopic dynamical activity. Here in the Stroop task, failure in the interaction between the Morphological agents and the Structuring agents and weakening of regulation messages in the system mirrors the failure of the response conflict monitoring system, executed in the anterior cingulate cortex (ACC), and a cognitive control system located in the dorsolateral prefrontal cortex (DLPFC). The response conflict monitoring system detects conflicts due to interference between processes. The cognitive control system modifies information processing in posterior brain regions to reduce conflicts thus found. 3.2
Wason Task
The Wason task is a common selection task in the psycholoy of reasonning, testing the subject’s ability to search counter-example to test an hypothesis. In order to further define the limits of the present system, and what we will need to add to the system to upgrade it to an efficient simulation tool, we implemented this test. Our first experiment is a classic version of the Wason card selection task [7], where subjects are asked which card(s) they believe must be turned in order to verify a conditional statement such as “If there is an A on one side of the card, then there is a 3 on the other side”. Otherwise, we used the same experimental setting as that of the Stroop task, using a one way link in the ontology to specify rules to the system. Currently only being able to do the Modus Ponens, (affirming the antecedent) and not the Modus Tollens (denying the consequent) , the system unsurprisingly reproduced the error that most humans do: Asked which card(s) it would turn to verify the rule, it correctly answered “A but never answered “7 as logic mandates. To help the system see that, we made the negation explicit. In the second experiment, we thus added four elements to the ontology: “P”, “Q”,“notP”, “notQ”, and the linked the Structuring agents as following: “A”- “P”; “3”- “Q”; “D”- “notP”; “7”- “notQ”. Furthermore, in order to make the need to verify the contraposed rule explicit in the systems “Verification” goal, we added a one way link from “notQ” to “notP”. Using the same input as in the previous experiment, the system was now able to answer correctly that cards “A” and “7” should be turned. This method was very artificial, and as we will point out in our plan of research, we would like to give the ability to the system to create/abstract itself this type of links.
4
Plan for Research
Our first research approach was to identify the strengths and limitations of the present system. The method we employed was the simulations of various
268
O. Larue
cognitive tasks (see section above). The Stroop task [5] simulations proved the validity of the approach, but also showed that we are not currently able to incorporate emotional materials in our Stroop task simulations. The Wason task [7] emphasized another missing aspect of our current approach: a higher-level organization, with which we wouldnt have to resort to the actual artificial manipulations of our simulation. CAAS was primarily developed for engineering purposes. But we believe that its properties are particularly suitable to cognitive simulations, and we are thus adapting it for this purpose by the incorporation of emotions and reasoning, important dimensions in cognitive processes. Such additions are also compatible with a parallel thread of our research: implementation of CAAS in cognitive agents. Therefore, a first research topic is the addition of emotions in CAAS. One way of doing this could be the addition of emotional structuring agents related to other semantic Structuring agents. Emotional structuring agents could be a way to implement in our system the type of neuromodulation involved in human emotions. Neuromodulation could thus be implemented in the system by variations in the regulation messages sent to these agents. A second research topic is the development of the Higher Level Organization (HLO). Our system needs to be further developed in order to have the ability to construct a knowledge base and generate reasoning models [6] (similar perhaps to Johnson-Lairds mental models). Reasoning models could be generated at a level on top of the PSE agents. Using information provided by analysis agents to PSE agents, HLO could refine the mental models. These reasoning models could be new goals sent to the PSE agents, that HLO would be able to link to task-specific goal. In the processing of a Wason task, we would be able to witness the construction of a wrong model which could be corrected by the presence of new information in the environment of the system (i.e a change in the wording of the task). A new higher-level organisation would help to abstract a style of configuration among agents (its shape) and transformation upon them (for example: negation). We could thus be able to propose new simulations in cognitive psychology and the psychology of reasoning.
References 1. Camus, M.: Morphology Programming with an Auto-Adaptive System. In: ICAI 2008 (2008) 2. Eliasmith, C.: Dynamics, control, and cognition. In: Robbins, P., Aydede, M. (eds.) Cambridge Handbook of Situated Cognition. CUP, Oxford (2009) 3. Roelofs, A.: Goal-referenced selection of verbal action. Psychological Review 110, 88–125 (2003) 4. Dale, R., Spivey, M.J.: From apples and oranges to symbolic dynamics. JETAI 26, 317–342 (2005) 5. Stroop, J.R.: Studies of interference in serial verbal reactions. J. of Exp. Psy. 18, 643–662 (1935) 6. Johnson-Laird, P., Byrne, R.: Deduction. Lawrence Erlbaum Associates, Mahwah 7. Wason, P.C.: Natural and contrived experience in a reasoning problem. In: Foss, B.M. (ed.) New Horizons in Psychology. Penguin, NY (1966)
Finding Small Backdoors in SAT Instances Zijie Li and Peter van Beek Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 3G1
Abstract. Although propositional satisfiability (SAT) is NP-complete, state-of-the-art SAT solvers are able to solve large, practical instances. The concept of backdoors has been introduced to capture structural properties of instances. A backdoor is a set of variables that, if assigned correctly, leads to a polynomial-time solvable sub-problem. In this paper, we address the problem of finding all small backdoors, which is essential for studying value and variable ordering mistakes. We discuss our definition of sub-solvers and propose algorithms for finding backdoors. We experimentally compare our proposed algorithms to previous algorithms on structured and real-world instances. Our proposed algorithms improve over previous algorithms for finding backdoors in two ways. First, our algorithms often find smaller backdoors. Second, our algorithms often find a much larger number of backdoors.
1
Introduction
Propositional satisfiability (SAT) is a core problem in AI. The applications are numerous and include software and hardware verification, planning, and scheduling. Even though SAT is NP-complete in general, state-of-the-art SAT solvers can solve large, practical problems with thousands of variables and clauses. To explain why current SAT solvers scale well in practice, Williams, Gomes, and Selman [13,14] propose the concept of weak and strong backdoors to capture structural properties of instances. A weak backdoor is a set of variables for which there exists a value assignment that leads to a polynomial-time solvable sub-problem. For a strong backdoor, every value assignment should lead to a polynomial-time solvable sub-problem. In this paper, we address the problem of finding all small backdoors in SAT instances. A small backdoor is a backdoor such that no proper subset is also a backdoor. This problem is important for studying problem hardness, which is generally represented as the time used or the number of nodes extended by a SAT solver. In addition, identifying all small backdoors is a first step to investigating how value and variable ordering mistakes affect the performance of backtracking algorithms—the ultimate goal of our research. A variable ordering heuristic can make a mistake by selecting a variable not in the appropriate backdoor. A value ordering heuristic can make a mistake by assigning the backdoor variable a value that does not lead to a polynomial sub-problem. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 269–280, 2011. c Springer-Verlag Berlin Heidelberg 2011
270
Z. Li and P. van Beek
Backdoors are defined with respect to sub-solvers, which in turn can be defined algorithmically or syntactically. Algorithmically defined sub-solvers are polynomial-time techniques of current SAT solvers, such as unit propagation. Syntactically defined sub-solvers are polynomial-time tractable classes, such as 2SAT and Horn. The size of backdoors with respect to purely syntactically defined sub-solvers is relatively large. On the other hand, it is possible that a simplified sub-problem is polynomial-time solvable before an algorithmically defined sub-solver finds a solution. Therefore, we propose a sub-solver that first applies unit propagation, and then checks polynomial-time tractable classes. We propose both systematic and local search algorithms for finding backdoors. The systematic search algorithms are guaranteed to find all minimal sized backdoors but are unable to handle large instances. Kilby, Slaney, Thi´ebaux, and Walsh [7] propose a local search algorithm to find small weak backdoors. Building on their work, we propose two local search algorithms for finding small backdoors. Our first algorithm incorporates our definition of sub-solver with Kilby et al.’s algorithm. Our second algorithm is a novel local search technique. We experiment on large real-world instances, including the instances from SATRace 2008, to compare our proposed algorithms to previous algorithms. Our algorithms based on our proposed sub-solvers can find smaller backdoors and significantly larger numbers of backdoors than previous algorithms.
2
Background
In this section, we review the necessary background in propositional satisfiability and backdoors in SAT instances. We consider propositional formula in conjunctive normal form (CNF). A literal is a Boolean variable or its negation. A clause is a disjunction of literals. A clause with one literal is called a unit clause and the literal in the unit clause is called a unit literal. A propositional formula F is in conjunctive normal form if it is a conjunction of clauses. Given a propositional formula in CNF, the problem of determining whether there exists a variable assignment that makes the formula evaluate to true is called the propositional satisfiability problem or SAT. Propositional satisfiability is often solved using backtracking search. A backtracking search for a solution to a SAT instance can be seen as performing a depth-first traversal of a search tree. The search tree is generated as the search progresses and represents alternative choices that may have to be examined in order to find a solution or prove that no solution exists. Exploring a choice is also called branching and the order in which choices are explored is determined by a variable ordering heuristic. When specialized to SAT solving, backtracking algorithms are often referred to as being DPLL-based, in honor of Davis, Putnam, Logemann, and Loveland, the authors of one of the earliest works in the field [1]. Let F denote a propositional formula. We use the value 0 interchangeably with false and the value 1 interchangeably with true. The notation F [v = 0] represents a new formula, called the residual formula, obtained by removing all
Finding Small Backdoors in SAT Instances
271
clauses that contain the literal ¬v and deleting the literal v from all clauses. Similarly, the notation F [v = 1] represents the residual formula obtained by removing all clauses that contain the literal v and deleting the literal ¬v from all clauses. Let aS be a set of assignments. The residual formula F [aS ] is obtained by cumulatively reducing F by each of the assignments in aS . Example 1. For example, the formula, F = (x ∨ ¬y) ∧ (x ∨ y ∨ z) ∧ (y ∨ ¬z ∨ w) ∧ (¬w ∨ ¬z ∨ v) ∧ (¬v ∨ u), is in CNF. Suppose x is assigned false. The residual formula is given by, F [x = 0] = (¬y)∧(y∨z)∧(y∨¬z∨w)∧(¬w∨¬z∨v)∧(¬v∨u). As is clear, a CNF formula is satisfied only if each of its clauses is satisfied and a clause is satisfied only if at least one of its literals is equivalent to true. In a unit clause, there is no choice and the value of the literal is said to be forced. The process of unit propagation repeatedly assigns all unit literals the value true and simplifies the formula (i.e., the residual formula is obtained) until no unit clause remains or a conflict is detected. A conflict occurs when implications for setting the same variable to both true and false are produced. Example 2. Consider again the formula F [x = 0] given in Example 1. The unit clause (¬y) forces y to be assigned 0. The residual formula is, F [x = 0, y = 0] = (z) ∧ (¬z ∨ w) ∧ (¬w ∨ ¬z ∨ v) ∧ (¬v ∨ u). In turn, the unit clause (z) forces z to be assigned 1. Similarly, the assignments w = 1, v = 1, and u = 1 are forced. Williams, Gomes, and Selman [13] formally define weak and strong backdoors. The definitions rely on the concept of a sub-solver A that, given a formula F , in polynomial time either rejects the input or correctly solves F . A sub-solver can be defined either algorithmically or syntactically. For example, a DPLLbased SAT solver can be modified to be an algorithmically defined sub-solver by using just unit propagation, and returning “reject” if branching is required, “unsatisfiable” if a contradiction is encountered, and “satisfiable” if a solution is found. Examples of tractable syntactic classes include 2SAT, Horn, anti-Horn, and RHorn formulas. A formula is 2SAT if every clause contains at most two literals, Horn if every clause has at most one positive literal, anti-Horn if every clause has at most one negative literal, and renamable Horn (RHorn) if it can be transformed into Horn by a uniform renaming of variables. A weak backdoor is a subset of variables such that some value assignment leads to a polynomial-time solvable sub-problem. Definition 1 (Weak Backdoor). A nonempty subset S of the variables is a weak backdoor in F for a sub-solver A if there exists an assignment aS to the variables in S such that A returns a satisfying assignment of F [aS ]. Example 3. Consider once again the formula F [x = 0] given in Example 2. After unit propagation every variable has been assigned a value and the formula F is satisfied. Hence, x is a weak backdoor in F with respect to unit propagation. A strong backdoor is a subset of variables such that every value assignment leads to a polynomial-time solvable sub-problem.
272
Z. Li and P. van Beek
Table 1. Summary of previous experimental studies, where DPLL means the sub-solver used was defined algorithmically based on a DPLL solver; otherwise the sub-solver was defined syntactically using the given tractable class Sub-solvers Instance domains DPLL structured 2SAT, Horn random 3SAT DPLL, Horn, RHorn graph coloring, planning, game theory, automotive configuration Paris et al. [9] RHorn random 3SAT, SAT competition Kottler et al. [8] 2SAT, Horn, RHorn SAT competition, automotive configuration Samer & Szeider [11] Horn, RHorn automotive configuration Ruan et al. [10] DPLL quasigroup completion, graph coloring Kilby et al. [7] DPLL random 3SAT Gregory et al. [4] DPLL planning, graph coloring, quasigroup Dilkina et al. [3] DPLL planning, circuits Williams et al. [13] Interian [6] Dilkina et al. [2]
Definition 2 (Strong Backdoor). A nonempty subset S of the variables is a strong backdoor in F for a sub-solver A if for all assignments aS to the variables in S, A returns a satisfying assignment or concludes unsatisfiability of F [aS ]. A minimal backdoor for an instance is a backdoor S such that for every other backdoor S , |S| ≤ |S |. A small backdoor refers to a backdoor S such that no proper subset of S is also a backdoor. A minimal backdoor can be viewed as a global minimum, and a small backdoor can be viewed as a local minimum.
3
Related Work
In this section, we review previous work on algorithms for finding weak and strong backdoors in SAT and their experimental evaluation (see Table 1). In terms of algorithms, Williams, Gomes, and Selman [13,14] present a systematic algorithm which searches every subset of variables for a backdoor. We modify this algorithm to find minimal backdoors. Interian [6] and Kilby, Slaney, Thi´ebaux, and Walsh [7] propose a local search algorithm for finding backdoors. Our algorithms build on Kilby et al’s. We discuss this all in detail in Section 4. In terms of experimental results, Dilkina, Gomes, and Sabharwal [2] show that strong Horn backdoors can be considerably larger than strong backdoors with respect to DPLL sub-solvers. Kottler, Kaufmann, and Sinz [8] compare 2SAT, Horn and RHorn and find that RHorn usually results in smaller backdoors. Samer and Szeider [11] compare Horn and RHorn strong backdoors and find as well that RHorn gives smaller backdoors. In general, previous experimental results show that backdoors with respect to syntactically defined sub-solvers have larger sizes than backdoors with respect to algorithmically defined sub-solvers. In contrast to previous work, which consider DPLL, 2SAT, Horn, and RHorn sub-solvers independently, we combine syntactic and algorithmic sub-solvers into a single sub-solver. We also propose improved algorithms and experimentally evaluate our proposals on larger, more varied instances.
Finding Small Backdoors in SAT Instances
273
Table 2. Algorithms for finding weak and strong backdoors Exact
Our exact algorithm for finding minimal weak backdoors in satisfiable instances; Strong Our exact algorithm for finding minimal strong backdoors in unsatisfiable instances; Kilby Kilby et al.’s [7] local search algorithm for finding small weak backdoors; KilbyImp Kilby et al.’s [7] algorithm that incorporates our definition of sub-solver; Tabu Our proposed local search algorithm for finding small weak backdoors.
4
Algorithms for Finding Backdoors
In this section, we introduce how we define sub-solvers and describe several algorithms for finding backdoors (see Table 2). In our proposed framework, we define the sub-solver both algorithmically and syntactically. Specifically, given a partial assignment to a subset of variables S, we first apply unit propagation and then check the following conditions to see if the resulting formula F belongs to a polynomial-time tractable class: 1. 2. 3. 4.
if if if if
F F F F
is is is is
satisfied; 2SAT; satisfied after assigning 0 (false) to every unassigned variable; satisfied after assigning 1 (true) to every unassigned variable.
If one of the above conditions is true, then S is a backdoor set. The first two conditions are trivial. The third condition covers (a superset of) Horn formula, while the last condition covers (a superset of) anti-Horn formula. If F is a Horn formula, it can be satisfied by assigning 0 to all variables unless it has unit clauses with a single positive literal. However, after unit propagation, F is guaranteed to have at least two unassigned literals in each clause. A similar reasoning applies if F is anti-Horn. In specifying our algorithms, we make use of the following low level procedures, where F is a CNF formula and v is a Boolean value. isSatisfied(F ) is2SAT(F ) isSat2SAT(F ) setVal(F, v)
4.1
return true iff F is already satisfied; return true iff F is a 2SAT formula; return true iff F is a satisfiable 2SAT formula; return true iff F is satisfied after assigning v to every unassigned variable.
Exact Algorithms: Exact and Strong
We describe exact algorithms, which are suitable for small instances with small backdoors. Algorithm Exact takes as input a formula F and finds all backdoors of size at most k by performing an exhaustive search. We run algorithm Exact with k = 1, 2, . . . , n − 1, until minimal backdoors are found. The algorithm calls procedure expand(V, S, k), which explores the variables of F in a depth-first manner. Given a set of variables V , a set of minimal backdoors
274
Z. Li and P. van Beek
S, and a positive integer k, the procedure returns true iff V is a backdoor and as a side-effect, S is updated. Given a value assignment to the variables in V , V is a backdoor if there is no conflict after unit propagation and the resulting formula F is in one of the polynomial-time tractable classes. The procedure recursively calls itself with one more variable added to V and k − 1. Algorithm: Exact(F, k) S ← ∅; for i ← 0 to n − 1 do expand({xi }, S, k); return S; Procedure: expand(V, S, k) foreach value assignment aV of V do if unit propagation of aV does not result in conflicts then if isSatisfied(F ) ∨ isSat2SAT(F ) ∨ setVal(F , 0) ∨ setVal(F , 1) then S ← S ∪ V ; return true; if k ≤ 1 then return false; j ← index of the last variable in V ; for i ← (j + 1) to n − 1 do expand(V ∪ {xi }, S, k − 1); return false;
The Exact algorithm can easily be modified to give algorithm Strong, which finds minimal strong backdoors in unsatisfiable instances. The idea is that if every value assignment to a set of variables V results in conflicts during unit propagation, then we are able to conclude the unsatisfiability of the instance. Thus, V is added to the list of strong backdoors. 4.2
Local Search Algorithms: Kilby, KilbyImp, and Tabu
The exact algorithms based on depth-first search are complete, but do not scale up to instances with larger backdoors. Here we discuss local search algorithms. In the local search algorithms each search state s is a backdoor, and the cost of a node s is the cardinality of the backdoor s. Kilby et al. [7] propose algorithm Kilby for finding small weak backdoors using local search. Given a formula F , the DPLL solver Satz-rand is first used to solve F , recording the set W of branching literals and the solution M . The set W is an initial backdoor as Satz-rand is able to solve F [W ] without branching. Then, algorithm Kilby takes the inputs F , W , and M to find small backdoors. The set B is the current smallest backdoor. The algorithm has three constants: RestartLimit, which controls the number of restarts, a technique for escaping from local minima; IterationLimit, which controls the amount of search between restarts; and CardMult, which defines the neighbors of the current candidate backdoor W . In each iteration, the algorithm randomly selects from M a set Z of |W |×CardMult literals that are not in W . The set Z of literals is appended to W , and procedure minWeakBackdoor is called to reduce the set W ∪ Z of literals into a small backdoor, which is the next search state.
Finding Small Backdoors in SAT Instances
275
Algorithm: Kilby(F, W, M ) S ← ∅, B ← W ; √ RestartLimit ← 2; RestartCount ← 0; IterationLimit ← n × 3; CardMult ← 2; while RestartCount < RestartLimit do RestartCount ← RestartCount + 1; W ← B; for i ← 0 to IterationLimit do Z ← |W | × CardMult literals chosen randomly from M \ W ; W ← minWeakBackdoor(F , W ∪ Z); if |W | ≤ |B| then S ← S ∪ W ; if |W | < |B| then B ← W ; RestartCount ← 0; return S; Procedure: minWeakBackdoor(F, I)
1
W ← ∅; while I = ∅ do Choose literal l ∈ I; I ← I \ {l}; if DPLL applied to F [W ∪ I] requires branching then // The following if statement is added in our sub-solver; if ¬is2SAT(F ) ∧ ¬setVal(F, 0) ∧ ¬setVal(F, 1) then W ← W ∪ {l}; return W ;
Kilby et al. use a simple sub-solver, which applies Satz-rand’s unit propagation. We modify their algorithm Kilby to use the more sophisticated sub-solver we define. Algorithm KilbyImp is the local search algorithm that results from adding Line 1 in procedure minWeakBackdoor. One further difference is that our DPLL solver is Minisat, where Minisat has a powerful pre-processor. We also propose a novel algorithm Tabu, which uses local search techniques, including Tabu Search, a best improvement strategy, and auxiliary local search. The search state W is the current candidate backdoor, and tabuList is a list of previously visited search states. The tabu tenure is set to 30 to prevent our Tabu from revisiting the last 30 search states. When the tabu list is full, the oldest state is replaced by the new state. The procedure searchNeighbors(W, S, M ) evaluates the neighborhood of W and updates W with the best improving neighbor not in tabuList. The while loop stops if no new small backdoors have been found in the last RestartLimit iterations. The procedure localImprovement(S, M ) is an auxiliary local search over the neighborhood of newly found small backdoors. The procedure searchNeighbors(W, S, M ) explores all IterationLimit neighbors of the current backdoor W to find a best non-tabu candidate backdoor. This is in contrast to Algorithm Kilby, which selects the first neighbor s encountered in the neighborhood of s without considering the cost of s ; i.e., |s |. The value of minCost is the minimal size of backdoors in Neighbor . If minCost is no larger than the size of the current smallest backdoor, then all the backdoors in Neighbor of size minCost are added to the list of small backdoors S. A small backdoor of size minCost is randomly selected from Neighbor to be the
276
Z. Li and P. van Beek
next search state. When minCost is larger than the size of the current smallest backdoor, the search can escape from local minima by making worse moves. If every non-tabu candidate backdoor in Neighbor has a larger size than the current smallest backdoor, the search moves to a best candidate backdoor from Neighbor . Algorithm: Tabu(F, W, M ) W ← minWeakBackdoor(F , W ); preSize ← |S|; RestartLimit ← 2; RestartCount ← 0; tabuList ← ∅; while RestartCount < RestartLimit do RestartCount ← RestartCount + 1; cost ← searchNeighbors(W , S, M ); if cost = 0 then break; tabuList ← tabuList ∪ W ; if |S| > preSize then RestartCount ← 0; preSize ← |S|; tabuList ← ∅; localImprovement(S, M ); return S; Procedure: searchNeighbors(W, S, M ) √ IterationLimit ← n × 2; CardMult ← 2; Neighbor ← ∅, Cost ← ∅; for i ← 0 to IterationLimit do Z ← |W | × CardMult literals chosen randomly from M \ W ; W ← minWeakBackdoor(F , W ∪ Z); if W ∈ tabuList then Neighbor ← Neighbor ∪ W ; Cost ← Cost ∪ |W |; if |Neighbor | = 0 then return 0; minCost ← min(Cost); if minCost ≤ current smallest backdoor size then S ← S ∪ {B ∈ Neighbor | |B| = minCost}; W ← select a backdoor from Neighbor with size minCost randomly; return minCost ;
The procedure localImprovement(S, M) is an auxiliary local search that attempts to find more minimal backdoors by replacing variables in s. The inspiration for the procedure is the observation that some variables appear in most backdoors and some backdoor sets only differ from each other by one variable. Procedure: localImprovements(S, M ) foreach new backdoor B ∈ S, B ∈ tabuList do tabuList ← tabuList ∪ B; foreach literal l ∈ {M \ B} do B ← minWeakBackdoor(F , B ∪ l); if |B| ≤ current minimum backdoor size then S ← S ∪ B;
Finding Small Backdoors in SAT Instances
277
Table 3. Size, percentage, and number of minimal backdoors found by the Exact algorithm when applied to small real-world instances with n variables and m clauses Instance n m BD size (%) # BDs grieu-vmpc-s05-24s 576 49478 3 (0.52%) 143 een-tip-sat-texas-tp-5e 17985 153 1 (0.01%) 2 anomaly 48 182 1 (2.08%) 2 medium 116 661 1 (0.86%) 5 huge 459 4598 2 (0.44%) 89 bw large.a 459 4598 2 (0.44%) 89 bw large.b 1087 13652 2 (0.18%) 7
5
Experimental Evaluation
In this section, we describe experiments on structured and real-world SAT instances to compare the algorithms shown in Table 2. The set of satisfiable test instances consists of planning instances from SATLIB [5] and all but six of the satisfiable real-world instances from SAT-Race 2008 (the instances excluded were those that Minisat was unable to solve within the competition time limit). The set of unsatisfiable test instances is from the domain of automotive configuration [12]. The instances were all pre-processed with Minisat, which can sometimes greatly reduce the number of clauses. The experiments were run on the Whale cluster of the SHARCNET system (www.sharcnet.ca). Each node of the cluster is equipped with four Opteron CPUs at 2.2 GHz and 4.0 GB memory. 5.1
Experiments on Finding Weak Backdoors
Algorithm Exact is able to find all minimal backdoors for instances with small backdoors (see Table 3). The sizes of minimal backdoors in the blocks world instances are smaller than those reported by Dilkina et al. [3] who report percentages between 1.09% to 4.17% even though they used clause learning in addition to unit propagation. The reason is that our sub-solver not only applies unit propagation, but also tests for polynomial-time syntactic classes. Systematic algorithms do not scale up to instances with larger backdoors, though. We also compared the small backdoors found by the local search algorithms, Kilby, KilbyImp, and Tabu. With different initial solutions as inputs, the local search algorithms were run repeatedly until a cutoff time was reached. Only the smallest backdoors found by the algorithms were recorded. The cutoff time was set to 3 hours for instances with fewer than 10,000 variables (see Table 4) and 15 hours for larger instances (see Table 5). For each instance, the algorithm that found the smallest backdoors among the three local search algorithms is highlighted, with the largest number of backdoors used to break ties. When the cutoff time was reached, we waited for the algorithms to finish the current iteration. Because Tabu takes longer to complete one iteration than Kilby and KilbyImp, the time when Tabu found small backdoors in some SATRace 2008 instances was a little longer than 15 hours. The longest time recorded
278
Z. Li and P. van Beek
Table 4. Size, percentage, and number of small backdoors found by the local search algorithms within a cutoff of 3 hours when applied to real-world instances with n variables (n < 10, 000) and m clauses Instance n SAT Competition 2002 apex7 gr rcs w5.shuffled 1500 dp10s10.shuffled 8372 bart11.shuffled 162 SAT-Race 2005 and 2008 grieu-vmpc-s05-24s 576 grieu-vmpc-s05-27r 729 simon-mixed-s02bis-01 2424 simon-s02b-r4b1k1.2 2424 Blocks world planning bw large.c 3016 bw large.d 6325 Logistics planning logistics.a 828 logistics.b 843 logistics.c 1141 logistics.d 4713
m
Kilby KilbyImp Tabu BD size (%) # BDs BD size (%) # BDs BD size (%) # BDs
11136 77 (5.13%) 1 47 (3.13%) 8557 9 (0.11%) 10520 9 (0.11%) 675 15 (9.26%) 4190 14 (8.64%) 49478 71380 13793 13811 50237 131607 3116 3480 5867 16588
3 4 8 8
(0.52%) (0.55%) (0.33%) (0.33%)
4 (0.13%) 6 (0.10%) 20 16 26 25
(2.42%) (1.90%) (2.28%) (0.53%)
143 710 566 394
3 4 8 7
(0.52%) (0.55%) (0.33%) (0.29%)
1934 3 (0.10%) 790 5 (0.08%) 147 1688 18 39
20 15 25 22
(2.42%) (1.78%) (2.19%) (0.47%)
4 53 (3.53%) 42885 9573 9 (0.11%) 59399 2903 14 (8.64%) 45044 143 3 (0.52%) 143 660 4 (0.55%) 3271 566 8 (0.33%) 10440 3 7 (0.29%) 16 15 3 (0.10%) 69 6 (0.10%) 6675 9789 387 61
24 16 28 28
15 640
(2.90%) 584257 (1.90%) 7634 (2.45%) 424467 (0.59%) 36610
was 168 seconds after the 15-hour cutoff time. It is possible that Kilby and KilbyImp would have found smaller backdoors during this leeway. Although Tabu takes longer in one iteration than Kilby and KilbyImp, Tabu is sometimes able to find a larger number of backdoors in the given time, and for instances that have small backdoors of size less than 10, a remarkably larger number. For many more of these real-world instances, KilbyImp outperformed Kilby and Tabu in finding small backdoors. Both Kilby and KilbyImp select the first candidate backdoor encountered. The Tabu algorithm searches the entire neighborhood for the best improvement, which can be too expensive when the backdoor size and the total number of variables are large. Williams et al. [13] experimented on practical instances with fewer than 10,000 variables and showed that such instances had relatively small backdoors. We extend their result to the SAT-Race 2008 instances, which have a huge number of variables and clauses. The SAT-Race 2008 instances have backdoors that consist of hundreds of variables. However, the backdoor size is usually less than 0.5% of the total number of variables. Thus, our results agree with Williams et al. that practical instances generally have small tractable structures. 5.2
Experiments on Finding Strong Backdoors
In previous work [11,2], unsatisfiable SAT benchmarks from automotive configuration [12] were used in the experiments. Among the 84 unsatisfiable instances, Minisat concludes the unsatisfiability of 71 instances after pre-processing. We applied the Strong algorithm to find minimal strong backdoors for the remaining 13 instances (see Table 6). The sizes of minimal strong backdoors range from 1 to 3, which are smaller than the sizes reported in [11,2]. We found smaller
Finding Small Backdoors in SAT Instances
279
Table 5. Size, percentage, and number of small backdoors found by the local search algorithms within a cutoff of 15 hours when applied to real-world instances with n variables (n > 10, 000) and m clauses. An entry of timeout indicates that the local search algorithm failed to find any small backdoor within the cutoff time. Instance ibm-2002-04r-k80 ibm-2002-11r1-k45 ibm-2002-18r-k90 ibm-2002-20r-k75 ibm-2002-22r-k75 ibm-2002-22r-k80 ibm-2002-23r-k90 ibm-2002-29r-k75 ibm-2004-01-k90 ibm-2004-1 11-k80 ibm-2004-23-k100 ibm-2004-23-k80 ibm-2004-29-k55 ibm-2004-3 02 3-k95 mizh-md5-47-3 mizh-md5-47-4 mizh-md5-47-5 mizh-md5-48-2 mizh-md5-48-5 mizh-sha0-35-3 mizh-sha0-35-4 mizh-sha0-36-1 mizh-sha0-36-3 mizh-sha0-36-4 post-c32s-gcdm16-22 velev-fvp-sat-3.0-b18 velev-vliw-sat-4.0-b4 velev-vliw-sat-4.0-b8 een-tip-sat-nusmv-t5.B een-tip-sat-vis-eisen narain-vpn-clauses-8 palac-sn7-ipc5-h16 palac-uts-l06-ipc5-h34 schup-l2s-motst-2-k315 simon-s03-w08-15
Kilby n m BD size (%) # BDs 104450 238773 252 (0.24%) 10 156626 290625 307 (0.20%) 3 175216 370661 360 (0.21%) 3 151202 319192 319 (0.21%) 4 191166 399095 453 (0.24%) 4 203961 427792 499 (0.25%) 1 222291 469900 537 (0.24%) 2 64686 258748 81 (0.13%) 11 64699 201260 148 (0.23%) 2 262808 565220 696 (0.27%) 4 207606 481764 524 (0.25%) 2 165606 379170 465 (0.28%) 2 37714 123699 67 (0.18%) 16 73525 169473 1297 (1.76%) 1 65604 153650 179 (0.27%) 1 65604 153778 184 (0.28%) 2 65604 153896 181 (0.28%) 2 66892 157184 203 (0.30%) 1 66892 157466 189 (0.28%) 6 48689 115548 258 (0.53%) 1 48689 115631 237 (0.49%) 1 50073 120102 261 (0.52%) 1 50073 120212 249 (0.50%) 1 50073 120279 237 (0.47%) 1 129652 88631 12 (0.01%) 133 35853 968394 228 (0.64%) 3 520721 13348080 timeout 521179 13378580 timeout 61933 42043 109 (0.18%) 6 18607 12801 8 (0.04%) 6087 1461772 4572347 timeout 114548 218043 10 (0.01%) 46 187667 606674 10 (0.01%) 152 507145 590065 timeout 132555 269328 233 (0.18%) 26
KilbyImp BD size (%) # BDs 154 (0.15%) 53 282 (0.18%) 7 331 (0.19%) 6 275 (0.18%) 17 424 (0.22%) 3 466 (0.23%) 4 534 (0.24%) 1 58 (0.09%) 26 87 (0.13%) 5 648 (0.25%) 1 455 (0.22%) 1 441 (0.27%) 1 52 (0.14%) 21 238 (0.32%) 2 179 (0.27%) 1 190 (0.29%) 1 181 (0.28%) 2 203 (0.30%) 1 189 (0.28%) 6 254 (0.52%) 2 237 (0.49%) 1 261 (0.52%) 1 260 (0.52%) 4 237 (0.47%) 1 12 (0.01%) 133 212 (0.59%) 1 timeout timeout 88 (0.14%) 35 8 (0.04%) 16466 timeout 10 (0.01%) 46 10 (0.01%) 152 timeout 115 (0.09%) 31
Tabu BD size (%) # BDs 184 (0.18%) 2 344 (0.22%) 2 496 (0.28%) 1 384 (0.25%) 1 551 (0.29%) 2 605 (0.30%) 1 624 (0.28%) 2 59 (0.09%) 1 93 (0.14%) 8 732 (0.28%) 1 618 (0.30%) 4 550 (0.33%) 1 49 (0.13%) 6381 251 (0.34%) 1 265 (0.40%) 4 232 (0.35%) 2 235 (0.36%) 1 289 (0.43%) 1 238 (0.36%) 1 238 (0.49%) 1 210 (0.43%) 1 219 (0.44%) 1 209 (0.42%) 5 220 (0.44%) 1 11 (0.01%) 126 227 (0.63%) 1 933 (0.18%) 1 timeout 92 (0.15%) 14318 8 (0.04%) 36941 timeout 10 (0.01%) 1533 10 (0.01%) 102 timeout 152 (0.12%) 4
backdoors because we applied a systematic search algorithm, and we defined sub-solvers both syntactically and algorithmically.
6
Conclusion
We presented exact algorithms for finding all minimal weak backdoors in satisfiable instances and all minimal strong backdoors in unsatisfiable instances. Building on Kilby et al.’s local search algorithm Kilby, we described our improved local search algorithms KilbyImp and Tabu for finding small weak backdoors. We empirically evaluated the algorithms on structured and real-world SAT instances. The experimental results show that our algorithms based on our proposed sub-solvers can find smaller backdoors and significantly larger numbers of backdoors than previous algorithms. In future work, we intend to use our
280
Z. Li and P. van Beek
Table 6. Size and number of minimal strong backdoors found by the Strong algorithm when applied to automotive configuration instances with n variables and m clauses Instance C168 FW SZ 128 C202 FS RZ 44 C210 FS RZ 23 C210 FS SZ 103 C210 FW RZ 57 C210 FW SZ 128 C220 FV SZ 65
n 1698 1750 1755 1755 1789 1789 1728
BD BD m size # 5425 3 6 6199 2 26 5778 3 17 5775 2 3 7405 2 4 7412 1 3 4496 1 2
Instance C168 FW SZ 66 C202 FW SZ 87 C210 FS RZ 38 C210 FW RZ 30 C210 FW SZ 106 C210 FW UT 8630
n 1698 1799 1755 1789 1789 2024
BD BD m size # 5401 1 3 8946 3 90 5763 2 4 7426 3 16 7417 2 3 9721 1 2
algorithms for finding backdoors to study value and variable ordering mistakes and their effect on the performance of backtracking algorithms.
References 1. Davis, M., Logemann, G., Loveland, D.: A machine program for theorem proving. Commun. ACM 5(7), 394–397 (1962) 2. Dilkina, B., Gomes, C.P., Sabharwal, A.: Tradeoffs in the complexity of backdoor detection. In: Bessi`ere, C. (ed.) CP 2007. LNCS, vol. 4741, pp. 256–270. Springer, Heidelberg (2007) 3. Dilkina, B., Gomes, C.P., Sabharwal, A.: Backdoors in the context of learning. In: Kullmann, O. (ed.) SAT 2009. LNCS, vol. 5584, pp. 73–79. Springer, Heidelberg (2009) 4. Gregory, P., Fox, M., Long, D.: A new empirical study of weak backdoors. In: Stuckey, P.J. (ed.) CP 2008. LNCS, vol. 5202, pp. 618–623. Springer, Heidelberg (2008) 5. Hoos, H.H., St¨ utzle, T.: SATLIB: An online resource for research on SAT. In: Gent, I.P., Maaren, H.v., Walsh, T. (eds.) SAT 2000, pp. 283–292. IOS Press, Amsterdam (2000) 6. Interian, Y.: Backdoor sets for random 3-SAT. Paper presented at SAT 2003 (2003) 7. Kilby, P., Slaney, J., Thi´ebaux, S., Walsh, T.: Backbones and backdoors in satisfiability. In: Proc. of AAAI, pp. 1368–1373 (2005) 8. Kottler, S., Kaufmann, M., Sinz, C.: Computation of Renameable Horn Backdoors. In: Kleine B¨ uning, H., Zhao, X. (eds.) SAT 2008. LNCS, vol. 4996, pp. 154–160. Springer, Heidelberg (2008) 9. Paris, L., Ostrowski, R., Siegel, P., Sais, L.: Computing Horn strong backdoor sets thanks to local search. In: Proc. of ICTAI, pp. 139–143 (2006) 10. Ruan, Y., Kautz, H., Horvitz, E.: The backdoor key: A path to understanding problem hardness. In: Proc. of AAAI, pp. 124–130 (2004) 11. Samer, M., Szeider, S.: Backdoor trees. In: Proc. of AAAI, pp. 363–368 (2008) 12. Sinz, C., Kaiser, A., K¨ uchlin, W.: Formal methods for the validation of automotive product configuration data. AI EDAM 17(1), 75–97 (2003) 13. Williams, R., Gomes, C., Selman, B.: Backdoors to typical case complexity. In: Proc. of IJCAI, pp. 1173–1178 (2003) 14. Williams, R., Gomes, C., Selman, B.: On the connections between backdoors and heavy-tails on combinatorial search. Paper presented at SAT 2003 (2003)
Normal Distribution Re-Weighting for Personalized Web Search Hanze Liu and Orland Hoeber Department of Computer Science, Memorial University St. John’s, N.L, Canada {hl5458,hoeber}@mun.ca
Abstract. Personalized Web search systems have been developed to tailor Web search to users’ needs based on their interests and preferences. A novel Normal Distribution Re-Weighting (NDRW) approach is proposed in this paper, which identifies and re-weights significant terms in vector-based personalization models in order to improve the personalization process. Machine learning approaches will be used to train the algorithm and discover optimal settings for the NDRW parameters. Correlating these parameters to features of the personalization model will allow this re-weighting process to become automatic.
1
Introduction
Web search is an essential tool for today’s Web users. Web search systems, such as Google, Yahoo! and Bing have been introduced to the public users and achieved great success. However, traditional search engines share a fundamental problem: they commonly return the same search results to different users under the same query, ignoring the individual search interests and preferences between users. This problem has hindered conventional Web search engines in their efforts to provide accurate search results to the users. To address the problem, personalized Web search has been introduced as a way to learn the individual search interests and preferences of users, and use this information to tailor the Web search to meet each user’s specific information needs [6]. Personalized Web search employs personalization models to capture and represent users’ interests and preferences, which are usually stored in the form of term vectors (see [2][6] for a review of vector-based models for personalized search). High-dimensional vectors are used to represent each user’s interest in specific terms that might be present in the search results. These vectors are then used to provide a personalized re-ranking of the search results. In this research, we focus on improving personalized Web search through refining such vector-based personalization models. The goal in our research is to develop methods to automatically identify and re-weight the significant terms in the target model. This approach is inspired
M.Sc. Supervisor.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 281–284, 2011. c Springer-Verlag Berlin Heidelberg 2011
282
H. Liu and O. Hoeber
by Luhn’s seminal work in automatic text processing [5], in which he suggests that the “resolving power” of significant terms follows a normal distribution placed over a term list ranked by the frequency of term occurrence. In other words, Luhn suggests that the mid-frequency terms are more content bearing than either common terms or rare terms, and so are better indicators for the subject of the text. This idea has been widely utilized in the fields of automatic text summarization [8] and Web search converge testing [1]. However, to the best of our knowledge, it has not been explored in the literature of personalized Web search. In the following sections, we will demonstrate how we could borrow Luhn’s idea to improve the vector-based models used in personalized Web search.
2
Normal Distribution Re-Weighting (NDRW)
The first step in the NDRW approach is to rank the terms in the vector-based personalization model according to their frequency, resulting in a term histogram as illustrated in Fig. 1. Luhn’s suggestion is that high-frequency and low-frequency terms are not valuable. By placing a normal distribution curve over top of the term histogram, we can assign a significance value to each term, reducing the weight of the terms near the two ends, and giving more weight to the valuable terms in the middle range. To calculate the term significance (T S) value for each term, we employ the following formula: T S(i) = normdist(s ∗ ri ) = √
1 2πσ 2
2
e−(s∗ri −μ)
/2σ2
(1)
where ri is the rank of a given term i, and s is a predetermined step size between any two adjacent terms along the x-axis. There are three parameters in this function that affect the shape of the normal distribution curve, and therefore the T S value for a given term. μ is the mean of the distribution; it decides the location of the centre of the normal distribution curve. σ 2 is the variance of the distribution; it describes how concentrated the distribution curve is around the mean. The step size s affects the steepness of the distribution curve given a
Fig. 1. NDRW re-weights the terms using a normal distribution curve
Normal Distribution Re-Weighting for Personalized Web Search
283
constant variance. Once appropriate parameters are chosen for μ, σ 2 and s which specify the location and shape of the normal distribution curve, T S values can be calculated for each term and used to re-weight the personalization model. miSearch [3] is an existing vector-based personalized Web search system that is used as the baseline system in this research. The novel feature of this personalization system is that it maintains multiple topic profiles for each user to avoid the noise which normally exists within single-profile personalization models. The topic profiles in miSearch are term vectors in which terms are extracted from the clicked result documents and weighted by term frequency. We have implemented the NDRW approach within this system to re-weight the terms in the topic profiles, and have been able to improve the accuracy of the ranked search results list by carefully choosing the NDRW parameters. However, an important part of this research is to automatically determine these parameters based on features within the target vector-based model. The process by which we plan to achieve this is outlined in the remainder of this paper.
3
Automatic Algorithm for NDRW
In order to develop the automatic algorithm for NDRW, we plan to employ a supervised machine learning scheme. There are three main steps in this plan: preparing the training data and test data for the learning process, defining the evaluation metrics to guide the learning, and training the optimum parameters and the algorithm. Twelve queries were selected from the TREC 2005 Hard Track [7] for previous evaluations on the baseline miSearch system [3]. We will continue to use this test collection as the training data for our experiments. These queries were intentionally chosen because of their ambiguity. For each query, 50 search results have been collected and judged for relevance. The value of the personalization approach will be decided based on whether the relevant documents can be moved to the top of the search results list. For the test data, we will select another 12 ambiguous queries from this test collection and provide relevance judgements on the documents retrieved. We will use average precision (AP) measured over the top-10 and top-20 documents as the evaluation metric. In order to facilitate the experiments, a test program will be implemented to automatically apply NDRW to the target personalization models with associated test queries, and directly output the resulting AP values, given a set of NDRW parameters. To train the optimum parameters for each set of training data, Particle Swarm Optimization (PSO) [4] will be employed. The test program mentioned above will play the role of the fitness function in the PSO. The fitness value will be calculated by 60% of the top-10 AP value plus 40% of the top-20 AP value. Each particle contains three parameters (μ, σ2 and s), and the optimum parameters are achieved when particles converge to the global best fitness value for a given set of training data.
284
H. Liu and O. Hoeber
After gathering the optimum parameters for each set of training data, it may be possible to discover relationships between the optimum parameters and the features within the corresponding personalization models. Furthermore, by analyzing these relationships, we may be able to establish general rules for choosing the NDRW parameters. With the established rules, the algorithm for automatically choosing the parameters can be constructed. We can then verify its quality by applying it to the test data, measuring the degree to which the AP is improved and how close the parameters are to the optimal parameters for each test query.
4
Conclusion and Future Work
In order to improve personalized Web search, we proposed a novel Normal Distribution Re-Weighting (NDRW) approach to identify and re-weight significant terms in vector-based personalization models. Currently, we are working on the main task of this research, which is to develop an automatic algorithm for choosing NDRW parameters based on the features of the target model. In the future, we plan to conduct user evaluations to measure the benefit of using the NDRW technique for improving personalized Web search in realistic settings.
References 1. Dasdan, A., D’Alberto, P., Kolay, S., Drome, C.: Automatic retrieval of similar content using search engine query interface. In: Proceedings of the ACM Conference on Information and Knowledge Management, pp. 701–710 (2009) 2. Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for personalized information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 54–89. Springer, Heidelberg (2007) 3. Hoeber, O., Massie, C.: Automatic topic learning for personalized re-ordering of web search results. In: Sn´ aˇsel, V., Szczepaniak, P.S., Abraham, A., Kacprzyk, J. (eds.) Advances in Intelligent Web Mastering - 2. Advances in Intelligent and Soft Computing, vol. 67, pp. 105–116. Springer, Heidelberg (2010) 4. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: Proceedings of IEEE International Conference on Neural Networks, vol. IV, pp. 1942–1948 (1995) 5. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research and Development 2, 159–165 (1958) 6. Micarelli, A., Gasparetti, F., Sciarrone, F., Gauch, S.: Personalized Search on the World Wide Web. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 195–230. Springer, Heidelberg (2007) 7. National Institute of Standards and Technology. TREC 2005 Hard Track, http: //trec.nist.gov/data/t14 hard.html 8. Shen, D., Chen, Z., Yang, Q., Zeng, H., Zhang, B., Lu, Y., Ma, W.: Web-page classification through summarization. In: Proceedings of the International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp. 242–249 (2004)
Granular State Space Search Jigang Luo and Yiyu Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {luo226,yyao}@cs.uregina.ca
Abstract. Hierarchical problem solving, in terms of abstraction hierarchies or granular state spaces, is an effective way to structure state space for speeding up a search process. However, the problem of constructing and interpreting an abstraction hierarchy is still not fully addressed. In this paper, we propose a framework for constructing granular state spaces by applying results from granular computing and rough set theory. The framework is based on an addition of an information table to the original state space graph so that all the states grouped into the same abstract state are graphically and semantically close to each other.
1
Introduction
State space search is widely used for problem solving in artificial intelligence. Hierarchical problem solving using an abstraction hierarchy is one of the most popular approaches to speed up state space search [1, 5, 6]. One major issue that impacts the search efficiency of an abstraction hierarchy is backtracking [3]. Many methods have been proposed and investigated for constructing a good abstraction hierarchy that has as few backtrackings as possible [2, 3]. However, in the existing methods the semantic information of states is not explicitly used. Granular computing is an emerged field of study dealing with problem solving at multiple levels of granularity. The triarchic theory of granular computing [8– 10] provides a conceptual model for thinking, problem solving and information processing with hierarchical structures, of which an abstraction hierarchy is an example. Rough set theory provides a systematic and semantically meaningful way to granulate a universe with respect to an information table [4]. Based on results from rough set theory and granular computing, in this paper we propose a framework for constructing abstraction hierarchies by using semantic information. An information table is used to represent semantic information about states. Consequently, a better abstraction hierarchy can be constructed such that states in an abstract state are close to each other both semantically and graphically. This may not only prevent backtrackings but also lead to a better understanding of a problem.
2
Abstraction Hierarchy
A state space can be modeled by a graph G = (S, E), where S is a nonempty set of states and E is a nonempty set of edges connecting the states. An edge C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 285–290, 2011. c Springer-Verlag Berlin Heidelberg 2011
286
J. Luo and Y. Yao
(s1 , s2 ) is in E if s1 can be transformed to s2 by one operator in a problem solving process. A solution to a problem is an edge path from state sstart to state sgoal , where sstart is the start state and sgoal is a goal state of the problem. An abstraction of a state space (S, E) is another state space (S , E ) such that the following two conditions are satisfied: 1) There exists a partition on S, blocks of the partition and the states in S are one-to-one mapped. If state s belonging to a partition block is mapped to a state s , then s is the pre-image of s and s is the image of s. The set of all pre-images of s is the pre-image set of s ; 2) (s1 , s2 ) is in E if and only if there is an edge in E that connects s1 ’s pre-image to s2 ’s pre-image. An abstraction of a state space may also have an abstraction, so there is a series of abstractions such that one is more abstract than the previous one. The state space and the series of abstractions form an abstraction hierarchy for state space. In an abstraction hierarchy the original state space is the level 0 state space, the first abstraction is the level 1 state space, and so on. The search through an abstraction hierarchy is to first search for a solution in the highest level state space, then in the lower level state spaces one by one until a solution in the level 0 state space is found. An abstraction hierarchy may speed up search for the following reasons: 1) states in a higher level state space are fewer than those in the original state space, search in a higher level state space is relatively faster than search in the original state space; 2) once a solution in higher level state space is found, it can serve as a guide to find a solution in the original state space, one only needs to search in the pre-image set of an abstract state that belongs to the abstract solution [3, 7]. An abstraction hierarchy does not always speed up search due to backtracking [3]. A good abstraction hierarchy should satisfy two requirements: 1) backtrackings are reduced to as few as possible; 2) all states in the same pre-image set are semantically close. The first requirement guarantees the efficiency for search process and the second requirement helps give a better understanding of the state space. In the next section we propose a granular computing approach to construct abstraction hierarchies that satisfy both requirements.
3
A Granular Computing Model for Constructing Abstraction Hierarchies
In existing methods for constructing an abstraction hierarchy, one typically uses the structural information about a state space. States that are close to each other according to their distance in the state space graph are grouped to form abstract states. Such a granulation may not necessarily reflect the semantic closeness of different states. To resolve this problem, we propose a new framework by combining attribute-oriented granulation of rough set theory for generating semantically meaningful abstraction and graph-oriented verification for selecting graphically meaningful abstraction. The proposed model has three main components. They are explained in this section.
Granular State Space Search
287
Attribute-Oriented Granulation Information tables [4] are an important knowledge representation method in granular computing. In an information table, a set of attributes is used to describe a set of objects. Definition 1. An information table is a tuple (U, At, {Va |a ∈ At}, {Ia |a ∈ At}), where U is a finite nonempty set of objects, At is a finite nonempty set of attributes, Va is a nonempty set of values for a ∈ At, and Ia : U −→ Va is an information function for a ∈ At. Each information function Ia is a total function that maps an object of U to exactly one value in Va . An information table can be conveniently presented in a table form. In an information table for representing a state space, U is the set of states, At is the set of attributes for describing the states, Va is the set of values that the attribute a could possess, and Ia determines the value for every state’s attribute a. If Ia (s) = k we say that the state s’s value for a is k. We adopt the three-disk Hanoi problem from Knoblock [3] as an example to illustrate the main ideas. For the three-disk Hanoi problem, there are three pegs: peg1, peg2, peg3 and three disks: A, B, C. Disk A is bigger than disk B and disk B is bigger than disk C. A disk can be put on any peg, and the top disk on one peg can be moved to the top of another peg. There is a constraint that no disk could have a bigger disk on its top. At the beginning all the disks are on peg1, we need find a way to move all the disks to peg3. We can construct an information table for the 27 states in the problem as (U, At, {Va |a ∈ At}, {Ia |a ∈ At}), where U is the set of all 27 states, At has three attributes A, B, C that represent the positions of disks A, B and C, respectively. VA has three values 1, 2, 3 that indicate A is on peg1, peg2 and peg3, respectively. For the same reason VB and VC also have three values 1, 2, 3. Ia is a function that maps a state to a value. Ia (s) = n if and only if in the state s, disk a is on peg pegn. This information table can be presented in a table form by Fig. 1(i). Columns are labeled by attributes, rows are labeled by state names. A row is a description of a state. For example, the row s3 describes the state that A is on peg1, B is on peg1 and C is on peg3. Definition 2. Suppose U is a universe of a domain, a Granule of U is a nonempty subset of U , a Granulation of U is a partition of U , that is, it is a set of granules of U such that the intersection of any two granules is empty and the union of all the granules is U . For two granulations of U , G and G , G is a refined granulation of G or G is a coarsened granulation of G, written G G , iff for every g ∈ G, there is a g ∈ G such that g ⊆ g . If G0 G1 G2 · · · Gn (n ≥ 1), we say G0 G1 G2 · · · Gn is an (n + 1)level Granulation Hierarchy, G0 is the level 0 granulation, G1 is the level 1 granulation and so on. If a universe is described by an information table, a granulation can be constructed by using a subset of attributes [4]. Let F ⊆ At be a subset of attributes. According to F , we can define an equivalence relation RF as: xRF y ⇔ ∀a ∈ F (Ia (x) = Ia (y))
288
J. Luo and Y. Yao
s1
s3
s2
s6
s8
s9
s∗1
s5
s7
s4
s23
s18
s∗3 s17
s16
s∗2
s22
s24
s∗6 s11 s21
s13
s∗8
s25
s27
s14
s∗5 s15
s10
s12
s19
s20
(b) {A,B} Induced Abstraction
(a) The Original State Space
state s∗1 s∗2 s∗3 s∗4 s∗5 s∗6 s∗7 s∗8 s∗9
s1
s2
state s1 s2 s3
s3
(c) {A} Induced Abstraction state s1 s2 s3 s4 s5 s6 s7 s8 s9
A 1 1 1 1 1 1 1 1 1
s∗9 s∗7
s∗4
s26
A 1 2 3
A 1 1 1 2 2 2 3 3 3
B 1 2 3 1 2 3 1 2 3
(d) {A} Induced Information (e) {A,B} Induced Information Table Table B 1 1 1 2 2 2 3 3 3
C 1 2 3 1 2 3 1 2 3
state s10 s11 s12 s13 s14 s15 s16 s17 s18
A 2 2 2 2 2 2 2 2 2
B 1 1 1 2 2 2 3 3 3
C 1 2 3 1 2 3 1 2 3
state s19 s20 s21 s22 s23 s24 s25 s26 s27
A 3 3 3 3 3 3 3 3 3
(i) An Information Table for Three-Disk Hanoi Problem Fig. 1. State Space and Information Table
B 1 1 1 2 2 2 3 3 3
C 1 2 3 1 2 3 1 2 3
Granular State Space Search
289
That is, two objects are equivalent if and only if they have the same values on all attributes in F . The equivalence class containing x is given by [x]RF = {y ∈ U |xRF y}. The partition U/R = {[x]RF |x ∈ U } induced by RF is a granulation of U and every equivalence class is a granule. Let G be the granulation of a state space induced by an attribute set F . The set of attributes At − F consists of all attributes that are not used. We delete all the columns in the information table that correspond to attributes in At − F , and delete duplicated rows to get a new information table, we take the rows as new states. In this way we obtain all the abstract states. Every abstract state corresponds to an equivalence class or a granule. We then create an abstraction by connecting these abstract states by edges as long as there are edges between two states in pre-image sets of two abstract states. As this abstraction is created by the granulation G induced by F , we call it the F induced abstraction. For example, Fig. 1(e) is the {A, B} induced information table and Fig. 1(b) is {A, B} induced abstraction. Backtracking-free Abstractions Selection An abstraction of a state space is called backtracking-free abstraction if a solution found in the abstract state space can be refined into a solution in the original space. That is, it is not necessary to backtrack to another abstract state in the higher level once we reach a lower level. Granulation based on a subset of attributes does not consider graphical information of a state space. An attribute-oriented abstraction may not necessarily be a backtracking-free abstraction. Thus, graphical information should be used to select graphically meaningful, i.e., backtracking-free, abstractions. One can analyze the edges of state space and select only the subsets of attributes that can induce backtrackingfree abstractions. Recall that [x]RF is an abstract state consisting of many states in a lower level. An incoming state in [x]RF is a state that has an incoming edge connected from a state outside [x]RF , an outgoing state of [x]RF is the state that has an outgoing edge connected to a state outside [x]RF . To avoid backtracking, we can select the subsets of attributes that induce abstractions satisfying the condition: for any equivalence class [x]RF , any pair of incoming and outgoing states are connected only by states in [x]RF . Constructing Abstraction Hierarchies Suppose G1 is the granulation induced by a subset of attributes A1 and G2 is the granulation induced by another subset of attributes A2 . If A1 ⊃ A2 , then G1 is a refined granulation of G2 . If A0 ⊃ A1 ⊃ A2 · · · ⊃ An , the granulations induced by A0 , A1 , · · · , An form an (n + 1)-level granulation hierarchy. It provides an (n + 1)-level abstraction hierarchy. If every granulation in an (n + 1)-level abstraction is a backtracking-free abstraction, we obtain a backtracking-free abstraction hierarchy. By combining attribute-oriented granulation and backtracking-free abstraction selection, one can easily construct a backtracking-free abstraction hierarchy.
290
J. Luo and Y. Yao
Take the three-disk Hanoi problem as an example, Fig. 1(a) is the original state space; since every edge is bidirectional, we did not draw the arrow of an edge. From all possible attribute-oriented granulations we select all the subsets of attributes that induce backtracking-free abstractions. They are {A}, {A, B}. As {A} ⊂ {A, B}, we can create a three level abstraction hierarchy as shown by Fig. 1, the level 0 (Fig. 1(a)) is the original state space, the level 1(Fig. 1(b)) is induced by {A, B}, the level 2(Fig. 1(c)) is induced by {A}.
4
Conclusion
In this paper, we propose a granular computing model for constructing abstraction hierarchies. Our model introduces an information table for describing states and combining states into abstract state. This approach guarantees that all the pre-images of the same abstract state are semantically close, and every abstract state is semantically meaningful. An abstraction hierarchy constructed by our model not only avoids backtracking, but also gives a better understanding of a problem.
References 1. Bacchus, F., Yang, Q.: Downward refinement and the efficiency of hierarchical problem solving. Artificial Intelligence 71, 43–100 (1994) 2. Holte, R.C., Mkadmi, T., Zimmer, R.M., MacDonald, A.J.: Speeding up problem solving by abstraction: a graph oriented approach. Artificial Intelligence 85, 321– 361 (1996) 3. Knoblock, C.A.: Generating Abstraction Hierarchies: An Automated Approach to Reducing Search in Planning. Kluwer Academic Publishers, Boston (1993) 4. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasonging about Data. Kluwer Academic Publishers, Boston (1991) 5. Sacerdoti, E.D.: Planning in a hierarchy of abstraction spaces. Artificial Intelligence 5, 115–135 (1974) 6. Shell, P., Carbonell, J.: Towards a general framework for composing disjunctive and iterative macro-operators. In: Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pp. 596–602 (1989) 7. Yang, Q., Tenenberg, J.D.: Abtweak: Abstracting a nonlinear, least commitment planner. In: Proceedings of the Eighth National Conference on Artificial Intelligence, pp. 204–209 (1990) 8. Yao, Y.Y.: Artificial intelligence perspectives on granular computing. In: Pedrycz, W., Chen, S.H. (eds.) Granular Computing and Intelligent Systems. Springer, Berlin (2011) 9. Yao, Y.Y.: A unified framework of granular computing. In: Pedrycz, W., Skowron, A., Kreinovich, V. (eds.) Handbook of Granular Computing, pp. 401–410. Wiley, New York (2008) 10. Yao, Y.Y.: Granular computing: past, present and future. In: 2008 IEEE International Conference on Granular Computing, pp. 80–85 (2008)
Comparing Humans and Automatic Speech Recognition Systems in Recognizing Dysarthric Speech Kinfe Tadesse Mengistu and Frank Rudzicz University of Toronto, Department of Computer Science 6 King’s College Road Toronto, Ontario, Canada {kinfe,frank}@cs.toronto.edu
Abstract. Speech is a complex process that requires control and coordination of articulation, breathing, voicing, and prosody. Dysarthria is a manifestation of an inability to control and coordinate one or more of these aspects, which results in poorly articulated and hardly intelligible speech. Hence individuals with dysarthria are rarely understood by human listeners. In this paper, we compare and evaluate how well dysarthric speech can be recognized by an automatic speech recognition system (ASR) and na¨ıve adult human listeners. The results show that despite the encouraging performance of ASR systems, and contrary to the claims in other studies, on average human listeners perform better in recognizing single-word dysarthric speech. In particular, the mean word recognition accuracy of speaker-adapted monophone ASR systems on stimuli produced by six dysarthric speakers is 68.39% while the mean percentage correct response of 14 na¨ıve human listeners on the same speech is 79.78% as evaluated using single-word multiple-choice intelligibility test. Keywords: speech recognition, dysarthric speech, intelligibility.
1
Introduction
Dysarthria is a neurogenic motor speech impairment which is characterized by slow, weak, imprecise, or uncoordinated movements of the speech musculature [1] resulting in unintelligible speech. This impairment results from damage to neural mechanisms that regulate the physical production of speech and is often accompanied by other physical handicaps that limit interaction with modalities such as standard keyboards. Automatic speech recognition (ASR) can, therefore, assist individuals with dysarthria to interact with computers and control their environments. However, the deviation of dysarthric speech from the assumed norm in most ASR systems makes the benefits of current speaker-independent (SI) speech recognition systems unavailable to this population of users. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 291–300, 2011. c Springer-Verlag Berlin Heidelberg 2011
292
K.T. Mengistu and F. Rudzicz
Although reduced intelligibility is one of the distinguishing characteristics of dysarthric speech, it is also characterized by highly consistent articulatory errors [1]. The consistency of errors in dysarthric speech can, in principle, be exploited to build an ASR system specifically tailored to a particular dysarthric speaker since ASR models do not necessarily require intelligible speech as long as consistently articulated speech is available. However, building a speaker-dependent (SD) model trained of spoken data from an individual dysarthric speaker is practically infeasible due to the difficulty of collecting large enough amount of training data from a dysarthric subject. Therefore, a viable alternative is to adapt an existing SI model to the vocal characteristics of a given dysarthric individual. The purpose of this study is to compare na¨ıve human listeners and speakeradapted automatic speech recognition (ASR) systems in recognizing dysarthric speech and to investigate the relationship between intelligibility and ASR performance. In earlier studies, it has been shown that ASR systems may outperform human listeners in recognizing impaired speech [2–4]. However, since intelligibility is typically a relative rather than an absolute measure [5], these results do not necessarily generalize. Intelligibility may vary depending on the size and type of vocabulary used, the familiarity of the listeners with the intended message or the speakers, the quality of recording (i.e. the signal-to-noise ratio), and the type of response format used. Yorkston and Beukelman [6] compared three different types of response formats: transcription, sentence completion, and multiple choice. In transcription, listeners were asked to transcribe the word or words that have been spoken. In sentence completion, listeners were asked to complete sentences from which a single word had been deleted. In the multiple choice format, listeners selected the spoken word from a list of phonetically similar alternatives. Their results indicated that transcription was associated with lowest intelligibility scores, while multiple choice tasks were associated with the highest scores. This clearly shows that listeners’ performance can vary considerably depending on the type of response format used. Therefore, when comparing human listeners and an ASR system, the comparison should be made on a level ground; i.e., both should be given the same set of alternative words (foils) from which to choose. In other words, it would be unfair to compare an ASR system and a human listener without having a common vocabulary, and since the innate vocabulary of our participants is unknown (but may exceed 17,000 base words [7]), we opt for a small common vocabulary. Hence, the multiple choice response format is chosen in this paper.
2 2.1
Method Speakers
The TORGO database consists of 15 subjects, of which eight are dysarthric (five males, three females), and seven are non-dysarthric control subjects (four males, three females) [8]. All dysarthric participants have been diagnosed by a
Humans vs. ASR Systems in Recognizing Dysarthric Speech
293
speech-language pathologist according to the Frenchay Dysarthria Assessment [9] to determine the severity of their deficits. According to this assessment, four speakers (i.e., F01, M01, M02, and M04) are severely dysarthric, one speaker (M05) is moderately-to-severely dysarthric, and one subject (F03) is moderately dysarthric. Two subjects (M03 and F04) have very mild dysarthria and are not considered as dysarthric in this paper as their measured intelligibility is not substantially different from the non-dysarthric speakers in the database. 2.2
Speech Stimuli
Three hours of speech are recorded from each subject in multiple sessions in which an average of 415 utterances are recorded from each dysarthric speaker and 800 from each control subject. The single-word stimuli in the database include repetitions of English digits, the international radio alphabets, the 20 most frequent words in the British National Corpus (BNC), and a set of words selected by Kent et al. to demonstrate phonetic contrasts [5]. The sentence stimuli are derived from the Yorkston-Beukelman assessment of intelligibility [10] and the TIMIT database [11]. In addition, each participant is asked to describe in his or her own words the contents of a few photographs that are selected from standardized tests of linguistic ability so as to include dictation-style speech in the database. A total of 1004 single-word utterances were selected from the recordings of the dysarthric speakers and 808 from control speakers for this study. These consist of 607 unique words. Each listener is presented with 18% of the data (singleword utterances) from each dysarthric subject where 5% of randomly selected utterances are repeated for intra-listener agreement analysis resulting in a total of 180 utterances from the six dysarthric individuals. In addition, a total of 100 single-word utterances are selected from three male and three female control subjects comprising about 6% of utterances from each speaker. Altogether, each participant listens to a total of 280 speech files which are presented in a random order. Inter-listener agreement is measured by ensuring that each utterance is presented to at least two listeners. 2.3
Listeners
Fourteen native North American English speakers who had no previous familiarity with dysarthric speech and without hearing or vision impairment were recruited as listeners. The listening task consisted of a closed-set multiple-choice selection in which listeners were informed that they would be listening to a list of single-word utterances spoken by individuals with and without speech disorders in a random order. For every spoken word, a listener was required to select a word that best matched his/her interpretation from among a list of eight alternatives. Four of the seven foils were automatically selected from phonetically similar words in the pronunciation lexicon, differing from the true word in one or two phonemes. The other three foils were generated by an HMM-based speech
294
K.T. Mengistu and F. Rudzicz
recognizer trained on the entire data to produce an N-best list such that the first three unique words different from the target word are selected. Listeners were allowed to replay prompts as many times as they want.
3
Intelligibility Test Results
For each listener, the percentages of correct responses out of the 180 dysarthric prompts and 100 non-dysarthric prompts were calculated separately. The correct percentages were then averaged across the 14 listeners to compute the mean recognition score of na¨ıve human listeners on dysarthric and non-dysarthric speech. Accordingly, the mean recognition score of human listeners is 79.78% for stimuli produced by dysarthric speakers and 94.4% for stimuli produced by control speakers. Figure 1 depicts the recognition score of the 14 na¨ıve listeners on stimuli produced by dysarthric and control speakers.
Fig. 1. Word recognition score of 14 na¨ıve human listeners
To measure the intelligibility of stimuli produced by a speaker, the responses of all listeners for the stimuli produced by that speaker are collected together and the percentage of correct identifications is computed. Accordingly, for severely dysarthric speakers, the intelligibility score ranged from 69.05% – 81.88% with the mean score being 75.2%. Speaker M05, who is moderately-to-severely dysarthric, had 87.88% of his words correctly recognized, and the moderately dysarthric speaker F03 had 90% of her words recognized correctly. These results are presented in Figure 2.
Humans vs. ASR Systems in Recognizing Dysarthric Speech
295
Fig. 2. Intelligibility score of six dysarthric speakers as rated by 14 na¨ıve human listeners
On average, listeners agreed on common utterances between 72.2% and 81.6% of the time with the mean inter-listener agreement being 77.2%. The probability of chance agreement here is 12.5% since there are 8 choices per utterance. Intra-listener reliability is measured as the proportion of times that a listener identifies the same word across two presentations of the same audio prompt. The mean intra-listener agreement across all listeners is 88.5%, with the lowest being 79.6% and the highest being 96.3% (listeners 7 and 10).
4 4.1
ASR Experiments and Results Data Description
The speaker-independent (SI) acoustic models are built using a subset of the TORGO database consisting of over 8400 utterances recorded from six dysarthric speakers, two speakers with very mild dysarthria, and seven control subjects. The SI models are trained and evaluated using the leave-one-out method; i.e., data from one speaker are held out for evaluation while all the remaining data from the other speakers are used for training. The held-out data from the test speaker is divided into an evaluation-set and an adaptation-set. The evaluation-set consists of all unique single-word stimuli spoken by the test dysarthric speaker (described in Section 2.2) while the remaining data are later used as adaptation-set to adapt a SI acoustic model to the vocal characteristics of a particular dysarthric speaker.
296
4.2
K.T. Mengistu and F. Rudzicz
Acoustic Features
We compare the performance of acoustic models based on Mel-Frequency Cepstral Coefficients (MFCCs), Linear Predictive Coding-based Cepstral Coefficients (LPCCs), and Perceptual Linear Prediction (PLP) coefficients with various feature parameters, including the use of Cepstral Mean Subtraction (CMS) and, the use of the 0th order cepstral coefficient as the energy term instead of the log of the signal energy. The use of CMS was found to be counterproductive in all cases. This is because single-word utterances are very short and CMS is only useful for utterances longer than 2–4 seconds [12]. The recognition performance of the baseline SI monophone models based on MFCC and PLP coefficients with the 0th order cepstral coefficient are comparable (39.94% and 39.5%) while LPCC-based models gave the worst baseline recognition performance of 34.33%. Further comparison on PLP and MFCC features on speaker-adapted systems showed that PLP-based acoustic models outperformed MFCC-based systems by 2.5% absolute. As described in [13], PLP features are more suitable in noisy conditions due to the use of different non-linearity compression; i.e., the cube root instead of the logarithm on the filter-bank output. The data used in these experiments consist of considerable background noise and other type of noise produced by the speakers due to hyper-nasality and breathy voices. These aspects may explain why PLP performed better than MFCCs and LPCCs in these experiments. The rest of the experiments presented in this paper are based on PLP acoustic features. PLP incorporates the known perceptual properties of human hearing, namely critical band frequency resolution, pre-emphasis with an equal loudness curve, and the power law model of hearing. A feature vector containing 13 cepstral components, including the 0th order cepstral coefficient and the corresponding delta and delta-delta coefficients comprising 39 dimensions, is generated every 15 ms for dysarthric speech and every 10 ms for non-dysarthric speech. 4.3
Speaker-Independent Baseline Models
The baseline SI systems consist of 40 left-to-right, 3-state monophone hidden Markov models and one single-state short pause (sp) model with 16 Gaussian mixture components per state. During recognition, the eight words that are used as alternatives for every spoken test utterance during the listening experiments are formulated as an eight-word finite-state grammar which is automatically parsed into the format required by the speech recognizer. The pronunciation lexicon is based on the CMU pronunciation dictionary1 . All ASR experiments are performed using the Hidden Markov Model Toolkit (HTK) [14]. The mean recognition accuracy of the baseline SI monophone models using PLP acoustic features on single-word recognition where eight alternatives are provided for each utterance is 39.5%. The poor performance of the SI models in recognizing dysarthric speech is not surprising since data from each dysarthric speaker deviates considerably from the training data. Word-internal triphone 1
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Humans vs. ASR Systems in Recognizing Dysarthric Speech
297
models show little improvement over the baseline monophone models for the dysarthric data in our database. Hence, we use the monophone models as our baseline in the rest of the experiments. 4.4
Acoustic and Lexical Model Adaptation
To improve recognition accuracy, the SI models are tailored to the vocal characteristics of each dysarthric subject. Here we use a 3-level cascaded adaptation procedure. First we use maximum likelihood linear regression (MLLR) adaptation followed by maximum a posteriori (MAP) estimation to adapt each SI model to the vocal characteristics of a particular dysarthric subject. We then analyze the pronunciation deviations of each dysarthric subject from the canonical form and build an associated speaker-specific pronunciation lexicon that incorporates their particular behavior of pronunciation. Using the adaptation data from a particular speaker, we perform a two-pass MLLR adaptation. First, a global adaptation is performed, which is then used as an input transformation to compute more specific transforms using a regression class tree with 42 terminals. We then carry out 2 to 5 consecutive iterations of Maximum a Posteriori (MAP) adaptation using the models that have been transformed by MLLR as the priors and maximizing the posterior probability using prior knowledge about the model parameter distribution. This process resulted in 25.81% absolute (43.07% relative) improvement. Using speaker-dependent (SD) pronunciation lexicons, constructed as described in [15], during recognition improved the word recognition rate further by an average of 3.18% absolute (8.64% relative). The SD pronunciation lexicons consist of multiple pronunciations for some words that reflect the particular pronunciation pattern of each dysarthric subject. In particular, we listened to 25% of speech data from each dysarthric subject and carefully analyzed the pronunciation deviations of each subject from the norm; i.e., the desired phoneme sequence as determined by the CMU pronunciation dictionary was compared against the actual phoneme sequences observed, and the deviations were recorded. These deviant pronunciations were then encoded into the generic pronunciation lexicon as alternatives to existing pronunciations [15]. Figure 3 depicts the performance of the baseline and speaker-adapted (SA) models on dysarthric speech. In total, the cascaded approach of acoustic and lexical adaptation improved the recognition accuracy significantly by 28.99% absolute (47.94% relative) over the baseline yielding a mean word recognition accuracy of 68.39%. For non-dysarthric speech, the mean word recognition accuracy of the SI baseline monophone models is 71.13%. After acoustic model adaptation, the mean word recognition accuracy rises to 88.55%.
5
Discussion of Results
When we compare the performance of the speaker-adapted ASR systems with the intelligibility rating of the human listeners on dysarthric speech, we observe
298
K.T. Mengistu and F. Rudzicz
Fig. 3. ASR performance on dysarthric speech
that in most cases human listeners are more effective at recognizing dysarthric speech. However, an ASR system recognized more stimuli produced by speaker F01 than the human listeners. Figure 4 summarizes the results.
Fig. 4. Human listeners vs. ASR system recognition scores on dysarthric speech
Humans are typically robust at speech recognition in the presence of even very low signal-to-noise ratios [16]. This may partially explain their relatively high performance here. Dysarthric speech contains not only distorted acoustic information due to imprecise articulation but also undesirable acoustic noise due to improper breathing that severely degrades ASR performance. Due to the remarkable ability of human listeners to separate and pay selective attention to the different sound sources in a noisy environment [17], the acoustic noise due to improper breathing has less impact on human listeners than in ASR systems. For instance, the audible noise produced by breathy voices and hyper-nasality is strong enough to confuse ASR systems while human listeners can easily ignore it. This suggests that noise resilience is an area that should further be investigated to improve ASR performance to dysarthric speech. Furthermore, approaches to deal with other features of dysarthric speech such as stuttering, prosodic
Humans vs. ASR Systems in Recognizing Dysarthric Speech
299
disruptions, and inappropriate intra-word pauses are areas for further investigation in order to build an ASR system that possesses comparable performance with human-listeners in recognizing dysarthric speech. Although there appears to exist some relationship between intelligibility ratings and ASR performance, the latter is especially affected by the level of background noise, and the involuntary noise produced by the dysarthric speakers. The impact of hyper-nasality and breathy voice appears to be more severe in ASR systems than in the intelligibility rating among human listeners on singleword utterances. F01, for instance, is severely dysarthric but the ASR performs better than the human listeners because most of the errors in her speech could be offset by acoustic and lexical adaptation. M04, on the other hand, who is also severely dysarthric, was relatively more intelligible but was the least well understood by the corresponding speaker-adapted ASR system since this speaker is characterized by breathy voice, prosodic disruptions, and stuttering.
6
Concluding Remarks
In this paper we compared na¨ıve human listeners and speaker-adapted automatic speech recognition systems in recognizing dysarthric speech. Since intelligibility may vary widely depending on the type of stimuli and response format used, our basis of comparison is designed so that both the human listeners and the ASR systems are compared on a level ground. Here, we use multiple choice format from a closed set of eight alternatives, where the same set of alternatives are provided for every single-word utterance to both the ASR systems and to the human listeners. Although, there is one case in which a speaker-adapted ASR system performed better than the aggregate of human listeners, in most cases the human listeners are more effective in recognizing dysarthric speech than ASR systems. However, the mean word recognition accuracy of the speaker-adapted ASR systems (68.39%) relative to the baseline of 39.5% is encouraging. Future work ought to concentrate on an improved method to deal with breathy voice, stuttering, prosodic disruptions, and inappropriate pauses in dysarthric speech to further improve ASR performance. Acknowledgments. This research project is funded by the Natural Sciences and Engineering Research Council of Canada and the University of Toronto.
References 1. Yorkston, K.M., Beukelman, D.R., Bell, K.R.: Clinical Management of Dysarthric Speakers. Little, Brown and Company (Inc.), Boston (1988) 2. Carlson, G.S., Bernstein, J.: Speech recognition of impaired speech. In: Proceedings of RESNA 10th Annual Conference, pp. 103–105 (1987) 3. Stevens, G., Bernstein, J.: Intelligibility and machine recognition of deaf speech. In: Proceedings of RESNA 8th Annual Conference, pp. 308–310 (1985)
300
K.T. Mengistu and F. Rudzicz
4. Sharma, H.V., Hasegawa-Johnson, M., Gunderson, J., Perlman, A.: Universal access: speech recognition for talkers with spastic dysarthria. In: Proceedings of INTERSPEECH 2009, pp. 1451–1454 (2009) 5. Kent, R.D., Weismer, G., Kent, J.F., Rosenbek, J.C.: Toward phonetic intelligibility testing in dysarthria. Journal of Speech and Hearing Disorders 54, 482–499 (1989) 6. Yorkston, K.M., Beukelman, D.R.: A comparison of techniques for measuring intelligibility of dysarthric speech. Journal of Communication Disorders 11, 499–512 (1978) 7. Goulden, R., Nation, P., Read, J.: How large can a receptive vocabulary be? Applied Linguistics 11, 341–363 (1990) 8. Rudzicz, F., Namasivayam, A., Wolff, T.: The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation (2011) (in press) 9. Enderby, P.: Frenchay Dysarthria Assessment. International Journal of Language & Communication Disorders 15(3), 165–173 (1980) 10. Yorkston, K.M., Beukelman, D.R.: Assessment of Intelligibility of Dysarthric Speech. C.C. Publications Inc., Tigard (1981) 11. Zue, V., Seneff, S., Glass, J.R.: Speech database development at MIT: TIMIT and beyond. Speech Communication 9(4), 351–356 (1990) 12. Alsteris, L.D., Paliwal, K.K.: Evaluation of the Modified Group Delay Feature for Isolated Word Recognition. In: Proceedings of International Symposium on Signal Processing and Applications, pp. 715–718 (2005) 13. Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. Journal of the Acoustical Society of America 87(4), 1738–1752 (1990) 14. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book. Revised for HTK Version 3.4, Cambridge University Engineering Department (2006) 15. Mengistu, K.T., Rudzicz, F.: Adapting Acoustic and Lexical Models to Dysarthric Speech. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (2011) (in press) 16. Lippmann, R.: Speech recognition by machines and humans. Speech Communication 22(1), 1–15 (1997) 17. Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge (1990)
A Context-Aware Reputation-Based Model of Trust for Open Multi-agent Environments Ehsan Mokhtari1 , Zeinab Noorian1 , Behrouz Tork Ladani2 , and Mohammad Ali Nematbakhsh2 1
University of New Brunswick, Canada 2 University of Isfahan, Iran {ehsan.mokhtari,z.noorian}@unb.ca {ladani,nematbakhsh}@eng.ui.ac.ir
Abstract. In this paper we have proposed a context-aware reputationbased trust model for multi-agent environments. Due to the lack of a general method for recognition and representation of context notion, we proposed a functional ontology of context for evaluating trust (FOCET) as the building block of our model. In addition, a computational reputation-based trust model based on this ontology is developed. Our model benefits from powerful reasoning facilities and the capability of adjusting the effect of context on trust assessment. Simulation results shows that an appropriate context weight results in the enhancement of the total profit in open systems.
1
Introduction
A wide range of open distributed systems including e-business, peer-to-peer systems, web services, pervasive computing environments and the semantic web are built in open uncertain environments. The building blocks for constructing these systems are autonomous agents that act and interact flexibly and intelligently. In the absence of legal enforcement procedures in these environments, trust is of central importance in establishing mutual understanding and confidence among participants. There are different approaches for trust modeling including sociocognitive, game theoretical, security oriented, modal logics and other operational approaches[10]. The constituent element for evaluating trust in real and virtual societies is reputation. Reputation refers to a perception that an agent has of others intentions and norms [9]. In large open multi-agent systems where interactions are infrequent it is not always possible to evaluate the trustworthiness of peers just based on direct experiences. Thereby, the social dimension of agents is developed to gather reputation information from other members of the society. Trust and reputation are context-dependent notions [9],[16]. That is, satisfactory interactions outcomes in a particular context would not necessarily assure high quality interaction results with the same transaction partner in different contexts [10]. Nevertheless, most existing trust models has neglected this issue and evaluate trust regardless of their negotiated contexts. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 301–312, 2011. c Springer-Verlag Berlin Heidelberg 2011
302
E. Mokhtari et al.
Many efforts have been made to model the notion of context with different approaches. Main approaches of context modeling can be categorized as: Key Value, Markup, Graphical, Object Oriented, Logic Based and Ontology Based modeling [15]. Strang et al.[15] have shown that the most promising assets for context modeling can be found in ontology category. Several models which claim to use context as one of their elements in trust evaluation have been developed [13],[2],[9],[8],[7]. However the existing context-aware trust models suffer from lack of a functional and applicable context recognition and representation method. Moreover, considering reputation values incorporated with their contexts is another shortcoming of existing trust models. In this paper we propose a context-aware reputation-based trust model for multi-agent environments to address such deficiencies. We propose a functional ontology of context for evaluating trust named FOCET. It provides agents with an extensible ontology to adopt different context elements with different importance levels pertaining to their subjective requirements and environmental conditions. Based on these principles, a computational model of trust is developed which aggregates several parameters to derive the trustworthiness of participants. We begin with a description on some of the related works in context modeling and context-aware trust evaluation area. Subsequently, we provide a detailed presentation of the proposed ontology of context and discuss the relevant context reasoning issues. In the next sections, we describe different constituent elements of our trust and reputation management mechanisms and present a proper computational trust model afterwards. Evaluation framework and experimental results are discussed in next sections. Finally we conclude by explaining some of the open problems and possible future works in this field.
2
Related Works
There is a wide variety of reputation-based trust models in the literature [1],[9], [13],[7]. Reputation mechanisms have been widely used in online electronic commerce systems. Online reputation mechanisms (e.g. those on eBay and Amazon Auctions [6],[14]) are probably the most widely used ones. Reputation in these models is a global single value representing a user’s overall trustworthiness. These trust models are not well-suited for dynamic environments where providers offer different types of services with different satisfaction degrees. Reputation is clearly a context-dependent quantity [9]. However, there are a few trust models which consider the context as a determinant factor in their trust evaluation. Liu et al.[8] introduced a reputation model that incorporates time and context as services presented by each entity. Zhu et al.[7]considered environment influences on agents’ behaviors as context . They presented a solution to cope with the fair reputation evaluation problem for agents who are in bad areas. Context has applied by Essin [16]as a sub factor in determining the action valuation and subject stake in the action. Blaze et al.[2]proposed a system named PolicyMaker in which the set of local policies are assumed as the context under which the trust is evaluated. Rey et al.[13] present a formalized graph-based context model
A Context-Aware Reputation-Based Model of Trust
303
for evaluating trust. They simply describe the context as a set of keywords and proposed a new data structure named context graph for context representation. Wang et al [18] proposed a context-aware computational trust model for multi agent systems. They consider m-types of context information and exploit a Bayesian network for trust estimation. However, features like the influence of reputation from one context to similar ones are missed. Since the most promising assets for context modeling can be found in ontological methods; [15],[13],[18] suffer from an efficient presentation of context and reasoning capabilities in their approaches. Our work differs from others in number of ways. We present an ontological context model which is built on an extensible core ontology description called FOCET. Using ontology as the context model cornerstone, reasoning capabilities could be comprehensively applied on context data. Also, the proposed model provides means for agents to subjectively use different context features. Further, in this model agents are able to assign different weights to their selected context features in order to adjust the effect of context on trust evaluation pertaining to different environmental circumstances.
3
Ontology of Context
The notion of context has been defined by different authors [15]. Dey et al. [4]have defined context within a comprehensive approach. He defines context as any information that is required to characterize the situation of an entity. An entity could be a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves [4]. In the proposed model, we have derived effective input factors for context reorganization and representation based on Dey et al.[4] definition of context. Inspired by SOUPA [3] we introduce the main effective key concepts which have influence on context in trust evaluation and representing them in a structured form through the ontology. That is, we have developed Functional Ontology of Context for Evaluating Trust (FOCET) that represents the core ontology for context in trust applications. FOCET contains eight main categories: Environment, Culture, Spatial Factors, Temporal Factors, History, Subject, User profile and Policy (Figure 1). These main categories could be elaborated by adding subontologies as their extensions [3]. We briefly describe each category of FOCET core in the following:
Fig. 1. Representation of FOCET dimensions
304
E. Mokhtari et al.
Environment: An environment provides the conditions under which agents exist [11]. In other words, the environment defines the world states and properties in which the participants operate. For example, in this paper we consider three different types of environments:1) high-risk, 2) low-risk and 3) mid-risk. Detailed description will be discussed in section 7. Culture: Culture includes natural, essential, inherited and permanent features of an agent. Agents having different cultures may have different priorities and preferences toward the same service. Thus, various aspects of culture such as language, nationality and morality in the trust evaluation process can be integrated in a culture dimension of FOCET. Spatial factors: Context is dependent on the agent space and location. In order to identify the agent spatial properties, characteristics such as location and vicinity are considered as spatial factors. This dimension may be beneficial in particular conditions in which agents prefer to communicate mostly with participants in a certain vicinity. Temporal factors: Time is a vital aspect for the human understanding and classification of context because most statements are related over the temporal dimension [21]. Features like agent’s age, life cycle and it’s communication time are considered as temporal factors. For example, some participants may consider older agents more trustworthy than younger ones. History: each participant maintains the previous interaction records and observations in the History dimension of FOCET. This might include environment dynamics, population tendency and interaction outcomes [10]. Subject: This feature addresses diverse aspects of the transaction criteria. It comprises the detailed descriptions of the providers’ identity, interaction contents and utilities. Policy: A policy is a deliberate plan of actions to guide decisions and achieve rational outcomes. Policy can guide actions toward those that are most likely to achieve a desired outcome. Policy could be consist of security, setup, communication, relationship and event handling policies. As an example, the trust threshold to commit a transaction could be adaptively adjusted based on the observation of the environmental circumstances. Many concepts of context ontology are semantically interrelated. We proposed a set of public rules for FOCET based on a common existing knowledge about the concepts. These rules will be integrated with the ontology in order to be available to all agents in the society. Using public rules, agents will benefit from an automatic reasoning process to refine and complement initial context data. For example, if provider P which offers pickup&delivery service is located in Canada, Fredericton we could deduce that the timezone for this provider is 4:00 hours behind GMT. Also, P is restricted by the federal and governmental business laws, hence; it would not be able to commit the delivery of certain goods which are prohibited by government of New Brunswick or federal government of Canada. Figure 2 exhibits sample public rules written in CLIPS. Aside from the public rules, each agent can use its private knowledge and reasoning capability to improve its decision making process. This private knowledge can be presented by
A Context-Aware Reputation-Based Model of Trust
305
Fig. 2. Sample public inference rule in FOCET
a set of private inference rules which the agent applies on FOCET. Each agent will define its own private rules based on its own knowledge using a predefined framework. For instance, one can infer from the same provider P that : crosscontinent delivery services would take longer using this provider rather than a provider located in Toronto due to the fact that there is a no direct flight to most of the main cities in the world from the Fredericton airport.
4
Reputation Component
In this model, reputation can be categorized according to information sources [10],[6]. Direct reputation is referred to the previous direct experiences with particular agents and Indirect reputation refers to information about the target agent which is received from other sources of information like advisors or trusted third parties. Reputation is clearly a context-dependent concept[9],[10]. That is, high reputation of a provider in a particular service would not necessarily cascaded to the other services it offers. For example, provider P ’s high reputation in inventory service should not affect its reputation in delivery service. However, in most models[5],[20],[17],[19] reputation information are communicated regardless of their negotiated contexts. In these models, a consumer agent evaluates the influence degree of recommending agents simply based on their deviations from its own opinions. The proposed context-aware trust model takes different approach and provides a consumer agent with a mechanism to examine different aspects of recommending agents’ negotiated contexts and evaluates their influence degree based upon their degree of similarity and relevancy with the prospected context of the future transaction. In this model we assume the recommending agents are honest but they might have different influence degrees in different transaction contexts. The reputation component employs FOCET as the constitutional element for representation, transmission, storage and retrieval of context information. It incorporates every reputation value within a set of context data presented by FOCET features.
306
5
E. Mokhtari et al.
Trust Component
As aforementioned, each provider may offer different kinds of services in different contexts. These contexts might be totally different or have some features in common. To measure the effect of a particular context on the other one, we define two individual metrics: 1) Weight Matrix (W M ) which includes the importance level of each dimension of context and 2) Relevancy Matrix (RM ) that indicates the similarity degree of each feature in the first context with the corresponding one in the second context. 5.1
Weight Matrix
The context dimensions in FOCET might have different importance levels dependent upon the applications requirements. For example, user profile and policy concepts are of central importance in an e-auction system while cultural characteristics may not be applicable at all. Therefore, we define the W M to handle this issue. W M is 1 × n matrix which n is the number of FOCET concepts and β which is in [0, 1] refers to the importance degree of the corresponding concepts. W M = [β1 , β2 , ..., βn ] 5.2
(1)
Relevancy Matrix
Aside from the heterogeneity problem of participants, the homogenous participants who contextually model the reputation information might have different terminologies and conventions to represent context concepts. This issue may result in a diverse perception from a unique concept. For example, one may use delivery to describe a particular service while the other uses shipping instead. Although both services express the same concept, the terms standing for them are different. To rectify, we introduce a Relevancy Matrix (RM ) to measure the similarity degree of a particular context features with the corresponding ones in another context. To calculate RM , we exploit the WordNet [12] API to measure a conceptual and semantic relation between different elements. RM is n × 1 matrix which n is the number of FOCET features and υ which is in the range of [0, 1] signifies the similarity level (Equation 2). RM = [υ1 , υ2 , ..., υn ] 5.3
(2)
Context Effect Factor
Given the W M and RM matrixes, we could measure the influence of participants’ experiences in different contexts on the prospected transaction context Ecxt with provider P . We call this measure the Context Effect Factor CEF(P ar,Hcxt,Ecxt,P ) where P ar ∈ C ∪ R implies a particular participant in the community and Hcxt refers to a typical previous context in which the participant has negotiated with P and can be computed as follows: n (1 − βi ) + βi × υi CEF(P ar,Hcxt,Ecxt,P ) = W M × RM = i=1 (3) n
A Context-Aware Reputation-Based Model of Trust
307
Here, n is the number of realized features in FOCET; therefore, CEF(P ar,Hcxt,Ecxt,P ) would be a scalar value in the range of [0, 1]. To clarify, suppose that consumer C intends to interact with provider P in context T . C would calculate the influence degree of each previous interaction context H experienced with P against the context T as CEF(C,H,T,P ) . 5.4
Trust Dynamics
Evidently, trust information might lose its credibility as time progresses. This is because of the fact that transaction peers might change their behaviors over time. In such case, despite the honesty of recommending agents in providing reputation information, their information might not be credible. Thus, in order to capture the risk of dynamicity in agent’s behavior we should consider the recent information of participants more important than the old ones. In so doing, consumer agents subjectively specify a degree of decay λ which is 0 ≤ λ ≤ 1 based on their policies in order to reduce the influence of the old reputation information adaptively. We formulate a time-dependent influence degree of participants CEF (P ar,Hcxt,Ecxt,P ) as follows: CEF (P ar,Hcxt,Ecxt,P ) = e(−λΔt) CEF(P ar,Hcxt,Ecxt,P )
(4)
Where Δt indicates the elapsed time period since previous interactions has taken place and could be determined by the temporal factor concept in FOCET.
6
Computational Model
In the proposed context-aware trust model, we define two individual metrics to evaluate the trustworthiness of the potential transaction partners:1) Direct Trust DT(C,P,Ecxt) which is merely based on consumer C’s direct experiences with provider P in context Ecxt and 2) Indirect Trust IT(R,P,Ecxt) that derives trust based on recommending agent’s R reputation reports. We can formalize DT(C,P,Ecxt) as follows: CEF (C,Hcxti ,Ecxt,P ) ∗ vi DT(C,P,Ecxt) =
Hcxti ∈(C,P )
n ∗ |(C, P )|
(5)
Where (C, P ) is a collection of Hcxt of consumer C with provider P and vi represents the rating value of Hcxti . Also, n indicates the number of context elements in FOCET. In addition, the IT(R,P,Ecxt) could be formulated as: CEF (Rj ,Hcxti ,Ecxt,P ) ∗ vi IT(R,P,Ecxt) =
Rj ∈R Hcxti ∈(Rj ,P )
n ∗ |R| ∗
|R|
j=1
|(Rj , P )|
(6)
Where R is the set of recommender agents which provide indirect reputation data.
308
E. Mokhtari et al.
Consumer agents may assign different significant levels ω to DT(C,P,Ecxt) and IT(R,P,Ecxt) components based upon their policies. Therefore, a linear weighted combination of these values is exploited to build the final trust value. Thus, the overall trust τ(C,P,Ecxt) could be calculated as: τ(C,P,Ecxt) = ω × DT(C,P,Ecxt) + (1 − ω) × IT(R,P,Ecxt)
7
(7)
Simulation Setting
We have implemented a multi-agent environment consisting 50 service providers that each of them provides 50 randomly selected services from a service pool containing 100 services. The providers would deliver high quality of service in numbers of their services depending on the environmental circumstances. Also, they might change their behavior in terms of presenting various quality of service rates for the same service during simulation. There are 100 agents who are able to act both as consumers and advisors. The simulation is run for 500 days and the system is monitored during this period. Each consumer C is subject to initiate one business transaction with a provider P in context H per day. The trustworthiness of P is calculated by C using its past direct reputation records of P regarding H along with the indirect reputation data gathered from the advisors R about P regarding context H. C will commit a transaction if the calculated trust level for P exceeds the expected trust threshold of C. This trust threshold is subjectively determined based on the policy dimension of FOCET exploited by each consumer. C will achieve gain proportionate to the value of the committed transaction if the provider satisfies the expected quality of service, otherwise; the consumer will suffer from the same amount of loss. In order to examine the efficiency of our approach, we consider three different types of environments: 1) low-risk, 2)mid-risk and 3)high-risk environments. In the low-risk environment, the majority of providers offer a satisfactory quality of service and the values of the transactions are low. Therefore, failure in delivering the expected quality of service would not result in a significant loss for consumers. In such an environment, consumers have low trust threshold for committing transactions with providers. That is, they will initiate business interactions with providers with minimum level of confidence about their quality of service. On the other hand, since the high-risk environment is mostly populated by low-quality service providers, the chance of consumers dealing with unqualified providers increase substantially. In this environment, the values of services offered by providers are mostly high, thus; a failure in delivering the expected quality of service will result in a significant loss for consumers. This inherent characteristic of highrisk environment requires consumers to have a high trust threshold for initiating transactions with providers. In the mid-risk environment high-quality and lowquality service providers are almost uniformly distributed. Business transactions in this environment usually have average values as well.
A Context-Aware Reputation-Based Model of Trust
8
309
Experimental Results and Analysis
In this section we analyze the effect of context in different types of environments. We examine the functionality of this approach when consumers have different preferences in using context in their evaluations. 8.1
Low-Risk Environment
As could be observed in Figure 3, increasing the context weight in low-risk environments would result in decreasing the gain of consumers. Since most of the providers offer high quality services in these environments, most of the transactions would lead to a gain achievement. To put this in perspective, suppose that provider P which usually offers high quality of services has not presented satisfactory services in just a few transactions of its delivery service to Europe in a certain time due to particular reasons such as inclement weather. However, in low-risk environment P would probably provide high quality delivery service in other contexts like delivery to North America. When the context has less influence in trust evaluation, consumer C propagates P ’s good reputation in other contexts to this one. Therefore, having just a few number of bad reputation records of P would not prevent C to do further delivery transactions with P . However, as the context weight increases in C’s trust evaluation process it would abort future delivery transactions which would probably associated with a considerable gain just because of P ’s few bad reputation records in this context. As a result, the total number of committed transactions and their associated gain would be reduced.
Fig. 3. The influence of different context weight in a low-risk environment
8.2
High-Risk Environment
In a high-risk environment, as the context weight increases, the propagation of providers reputation from one context to another one decreases proportionately. That is, consumers with high context weights merely rely on the reputation information obtained from the similar contexts in their trustworthiness evaluation. As such, future transactions with a particular provider in a specific context H who has bad reputation in other contexts fairly similar to H would be avoided. This feature shows its functionality specifically in a high-risk environment where the transaction value is high and failure in delivering high-quality of service are
310
E. Mokhtari et al.
associated with a huge loss. Through the analytical approach of context-aware model, consumers with reasonable context weights could cautiously avoid interacting with unqualified providers and increase their total profit substantially (Figure 4).
Fig. 4. The influence of different context weights in a high-risk environment
8.3
Mid-Risk Environment
Figure 5 demonstrates the effect of context weight on a mid-risk environment. In this environment some providers present high quality services while the others provide low quality ones. The effect of context in mid-risk environments is highly dependent on the rate of high/low quality service providers. That is, if highquality providers are dominant in the environment, a consumer with low context weight may benefit from more successful interaction results compare with a consumer whose context influence is high. On the contrary, the basis of the context-aware trust evaluation ensures consumers with high context weight to achieve large gain in this environment when low-quality providers take over the community.
Fig. 5. The influence of different context weights in a mid-risk environment
9
Conclusion and Future Work
In this paper we presented a context-aware reputation-based trust model for multi-agent systems. This model benefits from a functional ontology of context for evaluating trust (FOCET) along with a computational model of trust based on that. FOCET is an extensible core ontology for context recognition and representation which contains eight main categories: Environment, Culture, Spatial
A Context-Aware Reputation-Based Model of Trust
311
Factors, Temporal Factors, History, Subject, User profile and Policy. Exploiting proper public and private inference rules in FOCET, we are able to complement raw context data by adding more deduced facts from existing known ones. Using weight and relevancy matrices we are able to scale (up/down) the effect of context in evaluating trust. There are number of future works we could consider to extend our model. Uncertainty is an important factor which is usually discussed in context and trust modeling. Uncertainty could be emanated from information acquired from agents and their environments as well as the ontology inference rules. We intend to extend our model to support uncertainty in context reasoning and trust evaluation. Furthermore, since there is no authority in open environments to force agents to perform honestly, they should be provided with a means to form a trust network of the trusted peers. Using such trust network, the indirect reputation data would be more reliable. Experimental results demonstrate that various context weights would have different effects on total profit in different types of environments. Building an adaptive context-aware trust model to adjust this optimum context weight would be another milestone. Moreover, incorporating other sources of trust namely credentials, competence and capability with reputation and context would result in a comprehensive trust model for a wide range of real world applications.
References 1. Abdul-Rahman, A., Hailes, S.: A distributed trust model. In: Proceedings of the 1997 Workshop on New Security Paradigms, pp. 48–60. ACM, New York (1997) 2. Blaze, M., Feigenbaum, J., Lacy, J.: Decentralized trust management. In: Proceedings of the 1996 IEEE Symposium on Security and Privacy, SP 1996. IEEE Computer Society, Los Alamitos (1996) 3. Chen, H., Perich, F., Finin, T., Joshi, A.: Soupa: Standard ontology for ubiquitous and pervasive applications. In: International Conference on Mobile and Ubiquitous Systems: Networking and Services, pp. 258–267 (2004) 4. Dey, A.K.: Understanding and using context. Personal Ubiquitous Comput. 5, 4–7 (2001) 5. Shadbolt, N.R., Huynh, T.D., Jennings, N.R.: An integrated trust and reputation model for open multi-agent systems. In: AAMAS, pp. 119–154 (2006) 6. Jsang, A., Ismail, R., Boyd, C.: A survey of trust and reputation systems for online service provision. Decision Support Systems 43(2), 618–644 (2007); Emerging Issues in Collaborative Commerce 7. Lei, Z., Nyang, D., Lee, K., Lim, H.: Computational intelligence and security 8. Liu, J., Issarny, V.: Enhanced reputation mechanism for mobile ad hoc networks. pp. 48–62 (2004) 9. Mui, L.: computational Models of Trust and Reputation: Agents, Evolutionary Games, and Social Networks. PhD thesis, Massachusetts Institute of Technology (2003) 10. Noorian, Z., Ulieru, M.: The state of the art in trust and reputation systems: a framework for comparison. J. Theor. Appl. Electron. Commer. Res. 5, 97–117 (2010)
312
E. Mokhtari et al.
11. Odell, J.J., Van Dyke Parunak, H., Fleischer, M., Brueckner, S.A.: Modeling Agents and Their Environment. In: Giunchiglia, F., Odell, J.J., Weiss, G. (eds.) AOSE 2002. LNCS, vol. 2585, pp. 16–31. Springer, Heidelberg (2003) 12. Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet:similarity: measuring the relatedness of concepts. In: Demonstration Papers at HLT-NAACL 2004, HLT-NAACL–Demonstrations 2004, pp. 38–41. Association for Computational Linguistics (2004) 13. Ray, I., Ray, I., Chakraborty, S.: An interoperable context sensitive model of trust. Journal of Intelligent Information Systems 14. Resnick, P., Zeckhauser, R., Swanson, J., Lockwood, K.: The value of reputation on ebay: A controlled experiment. Experimental Economics, 79–101 (2006) 15. Strang, T., Linnhoff-Popien, C.: A context modeling survey. In: Workshop on Advanced Context Modelling, Reasoning and Management, UbiComp 2004 - The Sixth International Conference on Ubiquitous Computing, Nottingham, England (2004) 16. Viljanen, L.: Towards an Ontology of Trust. In: Katsikas, S.K., L´ opez, J., Pernul, G. (eds.) TrustBus 2005. LNCS, vol. 3592, pp. 175–184. Springer, Heidelberg (2005) 17. Jennings, N.R., Luck, M., Teacy, W.T.L., Patel, J.: Travos: Trust and reputation in the context of inaccurate information sources. Journal of Autonomous Agents and Multi-Agent Systems (2006) 18. Wang, Y., Li, M., Dillon, E., Cui, L.g., Hu, J.j., Liao, L.j.: A context-aware computational trust model for multi-agent systems. In: IEEE International Conference on Networking, Sensing and Control, ICNSC 2008, pp. 1119–1124 (2008) 19. Whitby, A., Josang, A., Indulska, J.: Filtering out unfair ratings in bayesian reputation systems. In: Proceedings of 7th International Workshop on Trust in Agent Societies (2004) 20. Zhang, J., Cohen, R.: Evaluating the trustworthiness of advice about seller agents in e-marketplaces: A personalized approach. Electronic Commerce Research and Applications (2008) 21. Zimmermann, A., Lorenz, A., Oppermann, R.: An Operational Definition of Context. In: Kokinov, B., Richardson, D.C., Roth-Berghofer, T.R., Vieu, L. (eds.) CONTEXT 2007. LNCS (LNAI), vol. 4635, pp. 558–571. Springer, Heidelberg (2007)
Pazesh: A Graph-Based Approach to Increase Readability of Automatic Text Summaries Nasrin Mostafazadeh1 , Seyed Abolghassem Mirroshandel1 , Gholamreza Ghassem-Sani1 , and Omid Bakhshandeh Babarsad2
2
1 Computer Engineering Department, Sharif University Of Technology, Tehran, Iran Mechatronics Research Laboratory, Computer Engineering Depatment, Qazvin Azad University, Qazvin, Iran {mostafazadeh,mirroshandel}@ce.sharif.edu [email protected], [email protected]
Abstract. Today, research on automatic text summarization challenges on readability factor as one of the most important aspects of summarizers’ performance. In this paper, we present Pazesh: a language-independent graph-based approach for increasing the readability of summaries while preserving the most important content. Pazesh accomplishes this task by constructing a special path of salient sentences which passes through topic centroid sentences. The results show that Pazesh compares approvingly with previously published results on benchmark datasets.
1
Introduction
Research in automatic text summarization (ATS) area dates back to late 60’s [1], though solving the problem in a substantial manner seems to yet require a long trail to work on. Among a variety of different approaches to address this problem, Graph-Based methods have noticeably attracted attention. For the first time in ATS history, Salton [2] proposes a graph as a model for input text. Recent graph-based approaches compute sentence importance based on the eigenvector centrality concept and applying ranking algorithms (e.g., TextRank [3], LexRank [4]). Today, most summarization researches are focused on extractive genre which tends to select a number of sentences out of the initial text. Normally there are many topic shifts in a text and highly scored sentences can come from diverse important topics which require careful output sentences selection. Some methods have already been devised to optimize the search problem of finding the best scoring summary [5,6] and ordering text entities based on chronological order of events [7]. Such methods might construct a sentence-to-sentence coherent body of information but they neglect preserving the most important content. In this paper, we introduce Pazesh: a new extractive, graph-based, and language-independent approach to address both readability and informativeness criteria of single-document summaries. At first, Pazesh segments the text C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 313–318, 2011. c Springer-Verlag Berlin Heidelberg 2011
314
N. Mostafazadeh et al.
in order to find topic centroid sentences. Then it ranks text entities by its specially constructed graphs and at last it finds the most precious path passing through centroid sentences. The obtained evaluation results show that our algorithm performs well on both readability and informativeness aspects. The rest of this paper is organized as follows: Section 2 introduces centroid finding and graph construction phases. In Section 3, the main phase of algorithm, “addressing Readability”, is revealed. Section 4 evaluates Pazesh, and finally Section 5 concludes the paper’s overall idea and focus.
2
Finding Centroid Sentences
Pazesh follows three steps for finding topic centroid sentences: segmenting the text into coherent partitions, scoring sentences of each segment and scoring segments individually. A topic is what a discourse segment or a sentence is about. A text can be segmented by its different topics denoted by sentences. In Pazesh, we utilize a segmentation algorithm to find the topic sentences to be used as landmarks of the final coherent path. Here we have used a simple partitioning approach: TextTiling [8] algorithm which is a method for partitioning a text into a number of coherent multi-paragraph partitions representing the text’s subtopics. This algorithm doesn’t rely on semantic relationships between concepts for discovering subtopic structures so that it is language-independent [8] and is highly suitable for Pazesh. After segmentation phase, we construct a weighed undirected graph for each segment. The sentences within a segment would be the graph’s nodes and similarity between sentences would be graph’s weighed edges. We weight word w of segment S named W(w,S) by freq(w,S)*IDF(w) where freq(w,S) is the number of sentences in segment S containing w, and IDF as the inverse document # all sentences frequency of word w equals log( # sentences containing w ). Then the similarity between sentences of each segment can be measured using cosine similarity formula as follows: 2 w:w∈S1 and S2 W (w, S) edge(S1 , S2 ) = (1) 2 ∗ 2 W (w, S) W (w, S) w:w∈S1 w:w∈S2 where S is the segment containing S1 and S2 . Now that we have the segments’ graph, we use PageRank[9] scoring function for scoring nodes of these graphs. The use of ranking algorithms in ATS has been first introduced in TextRank. PageRank does not require a deep linguistic knowledge and is highly suitable for Pazesh. Suggested PR(V) score of vertex V is computed as follows: wij P R(Vi ) = (0.25) + 0.25 ∗ ∗ P R(Vj ) (2) Vk ∈Ln(Vj ) w jk Vj ∈Ln(Vi )
Where Ln(V) is a link of the node. We call above score the segment-global score of each sentence. We define centroid sentence of each segment as its representative and the most salient entity having highest segment-global score.
Pazesh: Approach for Increasing Readability of Summaries
315
In Pazesh we achieve both non-redundancy and noise-removing by constructing and ranking another graph: ‘Segment graph’ which has segments as its nodes and segment similarities as the edges. The final graph of the whole text would be a ‘directed weighed’ graph with each sentence as its nodes and sentence similarities as its edges. This graph is constructed to compute the document global importance of each sentence. The same cosine similarity function here is calculated for all sentences of the text. We apply the PageRank algorithm to this final graph and generate each sentence’s document-global score. For tuning this final graph for Path-Finding phase we make it a directed acyclic graph (DAG). The direction of each edge would be from a predecessor sentence in initial text to any successor sentence. Also, we tag all edges weighed below a certain threshold γ, shallow edges and the rest edges are tagged deep edges. The final graph looks like Fig. 1.
Fig. 1. A sample final graph. Nodes in gray are centroids of the corresponding segment.
3
Addressing Readability
Text readability is a measure of how well and easily a text conveys its intended meaning. In Document Understanding Conference (DUC) the linguistic quality markers defined to evaluate the readability aspects of summaries are: Grammaticality, Non-Redundancy, Referential Clarity, Focus, and Structure and Coherence. In Pazesh, we address these readability criteria as follows: By being committed to chronological order of the initial document and its structure –as a grammatically accepted text–, we can avoid fundamental grammatical errors. Though being extractive undesirably lead the output to yet have some errors. We address non-redundancy and focus by “filtering centroid sentences” and also “retaining the informativeness of the output summary”. Though in Pazesh no alternations are made to input text’s sentences, so repeated use of nouns or noun-phrases are probable. Also, Implementing Pazesh’s idea for addressing referential clarity has been left as future work. Finally, Cohesion can be defined as the “links” that hold sentences of a text together and give the whole text a clear meaning. Here, we use this term on behalf of lexical relationship of sentences.
316
N. Mostafazadeh et al.
In the case of our final graph in order to have a cohesive output, each sentence should be followed by another sentence already being linked to that. Therefore, the final cohesive text can be assumed as a path. Pazesh’s path is built using deep edges (introduced earlier) to guarantee the readability of the path. Since topic shifts from one centroid to another is sensible, we accept shallow edges only as connectors between different segments. The path obtained from the final graph should “pass all centroid sentences” – landmarks – and have the “highest accumulative sentence score”. Passing through centroids has two outcomes: Firstly, centroids as the most salient sentences are guaranteed to be included in the summary. Secondly, the remainder sentences of the summary are connecting centroids; thus they come from prominent sub-topic segments and are important themselves. In Fig. 2 the path-finding method is applied to the graph in Fig. 1 and the output paths are depicted in gray.
Fig. 2. Path-Finding phase of Pazesh: Compression ratio= 0.4. Connecting two centroids is possible through 3 different Paths. Each path has a different accumulative sentences score/edge weight. Note that the path can go beyond the last centroid in order to meet the summary length constraint. In scenario Paz1, Path1 would be the output. In scenario Paz2, path1 and path3 have identical scores and one of them would be selected.
4
Evaluation
To evaluate the system, we used two distinct evaluation methods: an automatic evaluation by ROUGE toolkit1 and a manual readability evaluation based on the DUC readability assessment guidelines. The informativeness of Pazesh was evaluated in the context of an extractive single-document summarization task, using 567 news articles of DUC2002. For each article, the evaluation task is to generate a 100-word summary. For evaluation, we used recall score of the three different 1
ROUGE is available at http://berouge.com/
Pazesh: Approach for Increasing Readability of Summaries
317
metrics of ROUGE: ROUGE-1 (unigram-based), ROUGE-2 (bigram-based), and ROUGE-W. Two of Pazesh’s evaluation scenarios are: Constructing final path based on Paz1 (accumulative sentences score) vs. Paz2: (edge weights). These settings of Pazesh are compared with 4 baselines in 2 categories: 1) Outperforming informative systems (not being readable) including SentenceRank and another single-document extractive summarizer called method1 [10]. 2) A readable system, A* searching [5]. Due to lack of a universal assessment methodology and a few systems using identical measurement, comparing readability factor of our system with previous works is not feasible in high quality and quantity. Table 1 depicts mentioned ROUGE scores. Table 1. ROUGE scores of different systems. ‘-’: not reported. System ROUGE-1 ROUGE-2 ROUGE-W Paz1 0.38 0.24 0.08 Paz2 0.23 0.19 0.05 SentenceRank 0.45 0.19 0.15 Method1 0.47 0.2 0.16 A* 0.37 0.08 -
Comparing Pazesh1 with Pazesh2 reveals that taking into account the sentences scores result in undisputedly better informativeness than considering edge weights. The ROUGE-2 score is competitive with informative systems since ngram length greater than 1 in ROUGE estimates the fluency of summaries. However, the overall results are not outperforming systems focusing on only informativeness factor but are competitive with them to some degree. This is due to the fact that preserving both informativeness and readability -specially for the cases having summary length constraint-is a trade-off. Since 2006, a separate set of quality questions have been introduced by DUC to evaluate readability aspects of summaries. However, there are still no automatic evaluations to assess such aspects. For manual evaluation, we used 20 documents of the same dataset and scenarios used in automatic evaluation. Evaluation was accomplished by ten human judges who had read the DUC guidelines for readability assessment earlier. For each scenario, three different variations of summary length constraint were applied. Table 2 show the results of our manual evaluation on five-point scale. The results show that Pazesh performs very well on criteria it intended to address: Coherency and Focus. The overall results on ratio 0.6 outperform ratio 0.3 and no-ratio (minimum possible length) setting. This was expected since the readability aspect of a text inherently can be meaningful in a long text rather than a short text. Comparing the results reveals that Paz2 outperforms Paz1 on coherency aspect which could be estimated. Though Paz1 performs stronger than Paz2 on focus aspect. Putting all together, obtained results show that Pazesh has accomplished its intended mission: meeting both readability and informativeness.
318
N. Mostafazadeh et al.
Table 2. Readability Assessment. Left to right: ratio=0.6, ratio=0.3, ratio=not given. Criterion Paz1 Paz2 Paz1 Paz2 Paz1 Paz2 Grammaticality 5 5 5 5 5 5 Non-redundancy 4 5 5 5 5 5 Referential clarity 2 3 2 2 1 2 Focus 5 4 3 3 4 3 Struct. & Coherence 5 5 3 4 3 4
5
Conclusion
In this paper, we introduced Pazesh: a graph-based approach to address both readability and informativeness of automatic text summaries. This is accomplished by constructing a path of highly ranked sentences, as a readable sequence, passing through centroid sentences. As it is shown in experimental results, Pazesh is a powerful and also simple summarizer system. Acknowledgments. Special thanks go to Dr. Yahya Tabesh, Ali Moini for igniting the initial motivation on this research. This research is honoured to be supported by a grant from Iran’s National Foundation of Elites.
References 1. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research Development 2(2), 159–165 (1958) 2. Salton, G., Singhal, A., Mitra, M., Buckley, C.: Automatic Text Structuring and Summarization. Information Processing and Management Journal 33(2), 193–207 (1997) 3. Mihalcea, R.: Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of ACL, Spain (2004) 4. Erkan, G., Radev, D.R.: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. JAIR 22, 457–479 (2004) 5. Aker, A., Cohn, T., GaizauskasMulti, R.: Document summarization using A* search and discriminative training. In: Proceedings of EMNLP, USA, pp. 482–491 (2010) 6. Riedhammer, K., Gillick, D., Favre, B., Hakkani-T, D.: Packing the meeting summarization knapsack. In: Proceedings of the Interspeech Conference, Australia, pp. 2434–2437 (2008) 7. Barzilay, R., Elhadad, N., McKeown, K.R.: Inferring strategies for sentence ordering in multidocument news summarization. JAIR 17, 35–55 (2002) 8. Hearst, M.A.: Texttiling: segmenting text into multi-paragraph subtopic passages. CL 23(1), 33–64 (1997) 9. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, USA (1998) 10. Xiaojun, W., Jianwu, Y., Jianguo, X.: Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In: Proceedings of the 45th ACL, Czech Republic, pp. 552–559 (2007)
Textual and Graphical Presentation of Environmental Information Mohamed Mouine RALI-DIRO Universit´e de Montr´eal CP 6128, Succ Centre ville Montr´eal (Qu´ebec) H3C 3J7 [email protected] http://www-etud.iro.umontreal.ca/~ mouinemo/
Abstract. The evolution of artificial intelligence has followed our needs. At first, there was a need for the production of information, followed by the need to store digital data. Following the explosion in the amount of generated and stored data, we needed to find the information we require. The problem is now how to present this information to the user. In this paper we present ideas and research directions that we want to explore in order to develop a new approaches and methods for the synthetic presentation of objective information. Keywords: information vizualisation, artificial intelligence, text, graph.
1
Introduction
In recent years there has been an explosion in the volume of generated data in all fields of knowledge. The most difficult task has become its analysis and the exploration of this data. Data-mining allows us to locate the information needed. The information visualization and visual data mining can help to cope with information flow when it is combined with some textual description. In this thesis we develop methods to automate the task of exploring the information and present it to the user in the easiest way. We’ll build a generator climate bulletin. The content of the resulting bulletins will be a combination of text and graphics. this generation of bulletin must take into account the type of output device.
2
State of the Art
My thesis is in the line of the work of [5] who developped a model to solve the problem of generating integrated text and graphics in statistical reports. It considered several criteria such as the intention of the writer, the types of variables, the relations between these variables and the data values. Since then, many researchers have worked on the same subject. [6] presents some techniques to visualize and interact with various data sets. The process of visualization C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 319–322, 2011. c Springer-Verlag Berlin Heidelberg 2011
320
M. Mouine
depends on the input type and choices of output (user profiles, types of output device...). The input of a visualization process can be data, information and/or knowledge[3]. The reader may also refer to [2] where the author gives an overview of the field of information visualization. There are even those who have studied every last little detail (the choice of colors, shapes, location, . . . ) to produce a presentation that meets the user expectation [8]. A good visualization is a concept that can be judged on several criteria related to the user himself, the system, types of output device ...[1].
3
Problem Statement
In order to summarize and analyze large amounts of information, we expect to develop a method that automatically generates a visual report (graph, image, text ...). We want, through this approach, allow the user to easily retrieve all information used in generating this report without having to visit the whole mass of information. 3.1
M´ et´ eoCode
The RALI1 , to which I belong, has started a project2 in collaboration with Environment Canada (EC) that already publishes a large amount of meteorological information in XML form, this type of information is called MeteoCode. The selective display of personalized information would allow EC to provide the public with better targeted forecasts in time and space than those currently produced. The latter also exhibit a gap (breakdowns, problems ...) we’ll try to work improve. This is reflected by the fact that these forecasts are limited to a few tens of words found in regional weather forecasts. Already, more than 1000 weather reports are issued twice a day. Given the size of Canada, they have to stay general and cannot show all the details that are available in the MeteoCode According to the information in the MeteoCode, we want to develop a climate bulletin generator for an address or postal code given by the user most often her own. In addition, regional weather information must also be made available in different modes: graphics, web, weather radio and autoresponders. An important goal of our project is to study the development of innovative approaches for the communication user relevant of meteorological information while taking into account some time and geographic aggregation. The MeteoCode is already in XML validated by an XML schema, we are convinced that the entry is easily analyzed. Thus, we will focus on determining the most appropriate way to present data in a meaningful way with the type of output device. Given the size of the data, we will develop techniques for special use for aggregating data in space and time. 1 2
RALI comprises computer scientists and linguists of experience in automatic language processing. It is the largest NLP oriented university laboratory in Canada. http://rali.iro.umontreal.ca/EnvironmentalInfo/index.en.html
Textual and Graphical Presentation of Environmental Information
321
Experiments were conducted over the last two years by members of the RALI to illustrate the type of information available from EC, web prototypes have been developed to display weather information graphically using Protovis which is based on the Scalar Vector Graphics (SVG ) and another using alphanumeric information but placed geographically using Google Maps. A third experiment was performed using jqPlot. The latter is a jQuery plugin for creating graphics. These experiments improve interactivity with the user. This allowed us to experiment different ways of combining information published daily by Environment Canada with other approaches based on the Web. Although these prototypes are not put into production, they showed the potential of integrating environmental information with Web applications so that it becomes more accessible and useful. 3.2
Graphic
To use graphics effectively in the automatic generation of reports I will first draw some ideas from PostGraphe [4] which generated statistical reports containing text and graphics using an annotated description of the data. The user specifies his intentions to the system, types of data to be presented and relationships between data. Given the variety of available devices (Web, text, TV, PDA, etc..), it is not possible to adapt MeteoCode for each output device. On the other hand, the same information should not be presented in exactly the same on all devices. Each type of device provides its own constraints and new opportunities. In doing so, information must be accessible for all types of devices while ensuring that the meaning of the information remains intact. We will also need to develop good techniques for producing natural language summaries and for this we will build on results of the project SumTime [9]. This project has developed an architecture for generating short summaries of large time series data. We want to build the model used to choose the right word to be used in the summary. 3.3
Summarize: Good Forecast and Location Precision
In our project two types of data3 can be used. The first type is SCRIB4 and contains predictions raw form of matrices. The information is generated automatically by a numerical weather prediction model. This output is then fed to another system that allows meteorologists to comment and change the predictions somewhat. The result of this change is the file called MeteoCode in XML format.The second type of data is GRIB file (is a flat file). raw model forecasts at very high resolution. To further summarize the data, we will need to perform a spatio-temporal clustering of these data based on the similarity of meteorological conditions, and the relationship of cluster (spectral clustering algorithms [7] are known to 3 4
the difference between data is the location accuracy and quality of the forecast. good short term forecasts and medium resolution (weather stations).
322
M. Mouine
aggregate the data matrices according to their similarities) and conditions found. To meet this need we plan to use spectral clustering algorithms. These algorithms will be applied to the GRIB file and the file MeteoCode after transforming its contents in matrix form. In a second step, we will try to do the same work but using the SCRIBE file directly instead of using the file MeteoCode which is the result of a manual change by a meteorologist. The purpose of this clustering is to reduce the number of possible descriptions of weather conditions. Each condition could be described as the closest local kernels. Finally, a good report should draw the attention of the user to phenomena and unusual conditions, the detection of which could be based on a simple technique for estimating density of the local kernel.
4
Conclusion
I hope that at the end of this project, we will make the science of synthetic presentation of objective information progress. The new approaches and methods used in this work could also find their application in other fields in which information changes over time in large quantities. The aspect of visual presentation and automatic report generation can be applied also in finance, education, medicine. . .
References 1. Bonnel, N., Chevalier, M.: Crit`eres d’´evaluation pour les interfaces des syst`emes de recherche d’information. In: CORIA, pp. 109–115 (2006) 2. Chen, C.: Information visualization. Wiley Interdisciplinary Reviews: Computational Statistics 2(4), 387–403 (2010) 3. Chen, M., Ebert, D., Hagen, H., Laramee, R.S., Van Liere, R., Ma, K.L., Ribarsky, W., Scheuermann, G., Silver, D.: Data, information, and knowledge in visualization. Computer Graphics and Applications 29(1), 12–19 (2008) 4. Fasciano, M., Lapalme, G.: Intentions in the coordinated generation of graphics and text from tabular data. Knowledge and Information Systems 2(3), 310–339 (2000) 5. Fasciano, M.: G´en´eration int´egr´e de textes et des graphiques statistiques (1996) 6. Heer, J., Bostock, M., Ogievetsky, V.: A tour through the visualization zoo. Communications of the ACM 53(6), 59–67 (2010) 7. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007) 8. Ware, C.: Information visualization: perception for design. Morgan Kaufmann, San Francisco (2004) 9. Yu, J., Reiter, E., Hunter, J., Mellish, C.: Choosing the content of textual summaries of large time-series data sets. Natural Language Engineering 13(01), 25–49 (2007)
Comparing Distributional and Mirror Translation Similarities for Extracting Synonyms Philippe Muller1 and Philippe Langlais2 1
IRIT, Univ. Toulouse & Alpage, INRIA 2 DIRO, Univ. Montr´eal
Abstract. Automated thesaurus construction by collecting relations between lexical items (synonyms, antonyms, etc) has a long tradition in natural language processing. This has been done by exploiting dictionary structures or distributional context regularities (coocurrence, syntactic associations, or translation equivalents), in order to define measures of lexical similarity or relatedness. Dyvik had proposed to use aligned multilingual corpora and defines similar terms as terms that often share their translations. We evaluate the usefulness of this similarity for the extraction of synonyms, compared to the more widespread distributional approach.
1
Introduction
Automated thesaurus construction by collecting relations between lexical items has a long tradition in natural language processing. Most effort has been directed at finding synonyms, or rather “quasi-synonyms”, [1], lexical items that have similar meanings in some contexts. Other lexical relations such as antonymy, hypernymy, hyponymy, meronymy, holonymy are also considered, and some thesauri also consider semantically associated items with less easily definable properties (e.g. the Moby thesaurus). From the beginning, a lot of different resources have been used towards that goal. Machine readable dictionaries appeared first and generated a lot of effort aiming at the extraction of semantic information, including lexical relations, [2] or were used to define a semantic similarity between lexical items [3]. Also popular was distributional analysis, comparing words via their common contexts of use, or syntactic dependencies [4,5] in order to define another kind of semantic similarity. These approaches went on using more readily available resources in more languages [6]. More recently, a similar approach has gained popularity using bitexts in parallell corpora. Lexical items are considered similar when they are often aligned with the same translations in another language, instead of being associated to the same context words in one language [7,8]. A variation on this principle, proposed by [9], is to consider translation “mirrors”: words that are translations of the same words in a parallell corpus, as they are supposed to be semantically related. Although this idea has not been evaluated for synonyms extraction, it is the basis of some paraphrase extraction work, i.e. finding equivalent phrases of varying lengths in one language, see for instance [10]. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 323–334, 2011. c Springer-Verlag Berlin Heidelberg 2011
324
P. Muller and P. Langlais
Evaluations of this line of work vary but are often disappointing. Lexical similarities usually bring together heterogeneous lexical associations and semantically related terms, that are not easy to sort out. Synonymy is probably the easiest function to check as references are available in many languages, even though they may be incomplete (e.g. WordNet for English) and synonym extraction is supposed to complement the existing thesauri. If these approaches have the semantic potential most authors assume, there is still a lot to be done to harness that potential. One path is to select the most relevant associations output by the aforementioned approaches (dictionarybased, distribution-based, or translation-based), as in the work of [11], hopefully making possible a classification of lexical pairs into the various targeted lexical relations. Another is to combine these resources and possibly other sources of information; see for instance [8]. We make a step in this latter direction here, by testing Dyvik’s idea on lexical relation extraction. Translation mirrors have not been precisely evaluated in such a framework, and the way it can be combined with distributional information has not been investigated yet. We also pay particular attention to the frequency of the words under consideration, as polysemy and frequency variations of semantic variants seem to play an important role in some existing evaluations. Indeed, we show that mirror translations fare better overall than a reference distributional approach in the preselection of synonym candidate pairs, both on nouns and verbs, according to the different evaluations we performed. The remainder of this paper is organised as follows: we present in section 2 the resources we considered and our experimental protocol in section 3. We analyze our results in section 4. We relate our results to comparable approaches addressing the same issue in section 5 and finally conclude our work in section 6.
2
Resources and Input
We considered two reference databases in this work: – the WordNet lexical database,1 provided through the NLTK package API.2 WordNet provides a reference for the following lexical relations: synonyms, antonyms, hypernyms, hyponyms, holonyms, meronyms. Each lemma present in WordNet has on average 5-6 synonyms, or 8-10 related terms if all lexical relations are taken together. – the Moby thesaurus3 which provides not only synonyms but more loosely related terms. This resource is much richer and less strict than WordNet, as each target has an average of about 80 related terms. To estimate the frequencies of the words considered, we used data provided by the freely available Wacky corpus.4 1 2 3 4
http://wordnet.princeton.edu http://www.nltk.org http://www.gutenberg.org/dirs/etext02/mthes10.zip http://wacky.sslmit.unibo.it
Comparing Distributional and Mirror Translation Similarities
325
In order to compare similarities induced by distributional and mirror approaches, we have selected at random two sets of 1000 lexical items, a set of nouns and a set of verbs, that we will call “targets”. We imposed an arbitrary minimal frequency threshold on the targets (> 1000). The statistics of the two references, with respect to the test sets of targets considered, are shown in table 1. Table 1. Reference characteristics with the two target sets considered: median frequency in the Wacky corpus, mean number of associated terms, median, minimum and maximum number; (NB: Moby mixes verbs and nouns, so we considered terms having a noun form or a verb form in each case) number of associations Pos
Median frequency
Nouns Nouns Verbs Verbs
3
3,538 3,538 11,136 11,136
reference WordNet syns Moby WordNet syns Moby
mean med min max 3.6 73.8 5.6 113.2
2.0 57.0 4.0 90.0
1 3 1 6
36 509 47 499
Protocol
We consider similar terms derived either by a translation mirror approach (section 3.1) or a syntactic distributional approach (section 3.2). Each approach provides a set of associated terms, or “candidates”, ranked according to the similarity considered. These ranked candidates are then evaluated with respect to a reference for different lexical relations, either keeping n-best candidates or candidates above a given threshold. Details of the evaluation are presented below. As an example, table 2 shows candidates proposed by the translation mirrors for the randomly chosen target term groundwork. Note the huge difference in coverage of WordNet and Moby. 3.1
Translation Mirrors
The translation mirror approach is based on the assumption that words in a language E that are often aligned in a parallel corpus with the same word in another language F are semantically related. For instance, the french words manger and consommer are often both aligned with, and probable translations of, the english word eat. For the translation mirror based approach, we used a French-English bitext of 8.3 millions pairs of phrases in translation relation coming from the Canadian Hansards (transcripts of parliamentary debates). This bitext is used by the bilingual concordancer TSRali5 and was kindly made available to us by the maintainers of the application. We lemmatized both French and English sentences 5
http://www.tsrali.com/
326
P. Muller and P. Langlais
Table 2. First ten candidate associations proposed by our translation mirror approach for the target term groundwork and synonyms according to WordNet as well as a sample of related terms according to Moby. Underlined candidates belong to the WordNet reference, while those in bold are present in Moby; both are also reported in the reference they belong to. Words marked with ∗ are absent from the Hansards. Candidates
WordNet
Moby
base basis foundation land ground job field plan force development
base basis cornerstone foot fundament∗ foundation substructure∗ understructure∗
arrangement base basement basis bed bedding bedrock bottom briefing cornerstone ... [47 more]
using TreeTager.6 Then, we trained in both directions7 (English-to-French and French-to-English) statistical translation models, running the Giza++ toolkit in its standard setting.8 Our translation mirror approach makes use of the lexical distribution of the two models,9 pe2f and pf 2e , we obtained this way (see table 3 for an example). More specifically, we compute the likelihood that the word s is related to the target word w as: 1 p(s|w) ≈ pδe2f (f|w) × pδf22e (s|f) (τe2f (w) = {f : pe2f (f|w) > 0}) f∈τe2f (w)
where τe2f (w) stands for the set of French words associated to w by the model pe2f . In practice, two thresholds, δ1 and δ2 , control the noise of the lexical distributions: p• (t|s) if p• (t|s) ≥ δ δ p• (t|s) = 0 otherwise In the evaluations below we considered only the first 200 lemmas for each target, in order to compare it with the available distributional candidates presented in the following section. 3.2
Distributional Similarity
The distributional similarity we used is taken straight from the work of [5], as we believe it represents well this kind of approach. Also, a thesaurus computed 6 7 8 9
www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ IBM models are not symmetrical. http://code.google.com/p/giza-pp/ We used IBM models 4.
Comparing Distributional and Mirror Translation Similarities
327
Table 3. The 10 most likely associations to the words consommer and eat according to the lexical distributions pf 2e (•|consommer) and pe2f (•|eat) respectively pf 2e (•|consommer) consume (0.22) use (0.18) be (0.1) eat (0.092) consumption (0.048) consuming (0.037) take(0.023) drink (0.019) burn (0.012) consumer (0.011) . . . pe2f (•|eat) manger (0.39) consommer (0.08) se (0.036) de (0.031) nourrir (0.028) avoir (0.027) du (0.023) alimentation (0.017) gruger (0.016) qui (0.014) . . .
by Lin is freely available,10 which eases reproducibility. Lin used a dependencybased syntactic parser to count occurrences of (head lemma,relation,dep. lemma), where relation is a syntactic dependency relation. Each lemma is thus associated with counts for a set F of features (rel,other lemma), either as head of a relation with another lemma or as dependent. For instance, the verb eat has the features (has-subj,man), (has-obj,fries), (has-obj,pie), etc. Let c be the function giving the number of occurrences of a triplet (w, rel, w ) and let V be the vocabulary : c( , rel, w) =
c(w , rel, w)
I(w, rel, w ) = log
w ∈V
c(w, rel, ) =
c(w, rel, w ) × c( , rel, ) c(w, rel, ) × c( , rel, w )
c(w, rel, w )
w ∈V
c( , rel, ) =
c( , rel, w )
w ∈V
||w|| =
I(w, r, w )
(r,w )∈F (w)
I is the specificity of a relation (w, rel, w ), defined as the mutual information between the triplet elements [5]. Let’s note w the total information quantity associated to w. Finally, similarity between two lemmas w1 and w2 measures the extent to which they share specific syntactic contexts, using the information quantity of their shared contexts, normalised by the sum of their total information quantities. (r,w)∈F (w1 )∩F (w2 ) [I(w1 , r, w) + I(w2 , r, w)] sim(w1 , w2 ) = ||w1 || + ||w2 || The available thesaurus lists the closest 200 lemmas for each word in a given vocabulary.
4
Experiments and Results
Following the protocol introduced above, we evaluated the outputs of lexical similarities based on the n-best candidates, varying n, or based on varying similarity 10
http://webdocs.cs.ualberta.ca/$\sim$lindek/Downloads/sim.tgz
328
P. Muller and P. Langlais
thresholds, both for the distribution-based approach and the mirror approach. We have two different test sets to evaluate differences between nouns and verbs. As shown in table 1, syntactic categories differ in the number of synonyms or other lexical related items they possess, and it is likely that they impact as well the approaches we investigated; see [12] on the role of frequency in that perspective. We considered for evaluation only the items that were common to the reference and the lexicon covered by the resources used. For instances some synonyms from WordNet have no occurrences in the Hansard or in Lin’s database and this can be seen as a preprocessing filter of rare items. Moreover, both approaches we compare are sensitive to the typical frequencies of the targets considered. In both cases, all senses of a word are conflated in the computation and it is likely that more frequent usages dominate less frequent ones. We wanted to evaluate the role played by this factor and we took this into account in our evaluations by adding a varying frequency threshold on the candidates considered. For a set of values ci , we filtered out candidates with frequencies less than ci in a reference corpus (the Wacky corpus, mentioned above).11 Additionally, we took out a list of the most common items in the candidates of the target sets. We arbitrarily removed those terms that appear in more than 25% of the candidate lists (this threshold could be tuned on a development set in further experimentations). This includes very common nouns (e.g. thing, way, etc) and verbs (e.g. have, be, come), as well as terms that are over-represented in the Hansard corpus (e.g. house), since alignment errors induce some noise for very frequent items. Finally, we combined the candidate lists produced by the two approaches by filtering candidates for one approach that are not present in the other’s candidate list. We are interested in two aspects of the evaluation: how much of the reference is covered by our approaches, and how reliable they are, that is, we want the top of the candidate list to be as precise as possible with respect to an ideal reference. In order to do so, we evaluate our approaches according to precision and recall12 at different points in the n-best list or at different threshold values. We also compute typical information retrieval measures to estimate the relevance of the ranking: mean average precision (MAP), mean reciprocal rank (MRR). MAP computes the precision at each point where a relevant term appears in a list of candidates; MRR is the average of the inverses of the ranks of the first relevant term in a list. Last, we looked at the precision of each method assuming an “oracle” gives them the right number of candidates to consider for each target, a measure called R-precision in the information retrieval literature. So for instance, the 10 candidates of table 2 evaluated against the WordNet reference would receive a precision of 3/10 and a recall of 3/5, (and not 3/8, because understructure, substructure and fundament are absent from the Hansard). R-precision would also be 3/5, since all correct candidates are found at ranks less than the reference size (5 synonyms). Precision at rank 1 would be 11 12
The thresholds were chosen to correspond to different ranges of lexical items. For the sake of readability, we report precision and recall as percentage.
Comparing Distributional and Mirror Translation Similarities
329
1 while precision at rank 5 would be 3/5. The MAP would be 0.63 = 6.29/10 = (1/1 + 2/2 + 3/3 + 3/4 + . . . + 3/10) / 10 and MMR would be 1 in this case because the first candidate is correct. It would be 1/2 if only the second were correct, etc. Our experiments led to the observation that it is better to cut the n-best list at a given rank than to try to find a good similarity threshold, and we thus only detail results for the first method. 4.1
WordNet
Table 4 shows the results for nouns with respect to synonyms in WordNet. For each approach we report precision at ranks n=1, ..., 100 in the candidate list, MAP, MRR, the R-precision, the number of considered synonym pairs from the reference (ref ), with respect to which the overall recall is computed. We also report the influence of different frequency filters. A line with f>5000 means we consider only candidates and reference items with a frequency above 5000 in Wacky. As the WordNet reference has few synonyms, one should focus on precisions at low ranks (1 and 5) as well as the oracle, R-precision: all others are bound to be quite low. The other cutoffs make more sense for the evaluation with respect to Moby, and are here for comparison. This being noted, table 4 calls for several comments. First of all, we observe that the precision of the mirror approach at rank 1 culminates at 22% while overall recall tops at 60%, a much better score than the distributional approach we tested. Second, it is noticeable that filtering out less frequent candidates benefits the mirror approach much more than the distributional one. It might be a consequence of having a smaller corpus to start with, in which rarer words have less reliable alignment probabilities. Third, we observe that combining the candidates of both approaches yields a significant boost in precision at the cost of recall. This is encouraging since we tested very simple combination scenarios: the one reported here consists in intersecting both lists of candidates. Table 4. Results (percentages) for nouns, micro-averaged, with respect to synonyms in WordNet n-best
P1 P5 P10 P20 P100 MAP MRR R-prec ref recall
f>1 16.4 5.1 f>5000 19.1 5.4 Mirror f>20000 22.1 5.7
3.8 3.8 3.9
2.7 2.6 2.5
1.3 1.2 1.1
11.9 11.3 9.8
15.1 13.2 11.4
16.6 17.5 22.7
2342 1570 1052
50.0 54.8 60.6
f>1 17.4 5.2 f>5000 16.5 5.0 f>20000 17.5 4.5
3.5 3.5 3.3
2.5 2.5 2.5
1.5 1.6 1.6
11.7 9.2 7.3
14.3 10.8 8.4
14.7 16.7 20.1
2342 1570 1052
35.9 36.6 36.9
f>1 25.8 7.5 f>5000 27.4 7.4 f>20000 26.1 6.4
5.7 5.5 4.7
4.4 4.3 3.5
3.8 3.8 2.6
15.9 12.7 9.7
17.6 13.6 10.4
22.0 24.6 28.9
2342 1570 1052
29.3 31.1 32.7
Lin
M/L
330
P. Muller and P. Langlais
Last, the results on verbs are quite similar to those for nouns, with a better precision at low ranks, and at higher frequency cutoffs, even though the oracle evaluation is roughly the same for all configurations. Again, filtering one method with the other yields better results, with oracle precision between 20% and 27%, similarly to what is observed on nouns. We observe similar behavior on verbs, a much higher recall for mirrors and a better precision on frequent candidates (best P1 is 25 against 23), but Lin is higher at P1 without frequency filter (30 against 23) and then precisions after P1 are roughly the same or better for mirrors. 4.2
Moby
Table 5 shows the results for nouns with respect to the related terms in the Moby thesaurus. We expected that this reference would be closer to what is recovered by a distributional similarity, and that is indeed the case for nouns: Lin’s precision is superior across the board, even by 10 points at rank 1. However, both methods are comparable on verbs. One notable fact is that both similarities capture a whole lot more than just synonymy so the scores are much higher than on WordNet, and this can be considered somewhat of a surprise for the mirror translations, since this method should capitalise on translation relations only. Also, in almost all cases, the overall recall is higher with the translation mirror method, an observation consistent with our experiments on WordNet. Filtering out low frequency words has mixed effects: precision is slightly less for f>20000 than f>1 but the corresponding recall of high frequency related terms is higher. The combinations of the two methods consistently improve precision (again to the detriment of recall). As a conclusion, related terms do appear in mirror translations, even if they seem to do so with lower similarity scores than synonyms, and we have to investigate more precisely what is happening (translations approximations or errors or a better coverage of synonymy in the Moby thesaurus than in WordNet). Table 5. Results (percentages) for nouns, micro-averaged, with respect to related terms in Moby n-best
P1
P5 P10 P20 P100 MAP MRR R-prec ref recall
f>1 33.7 15.8 13.3 11.0 f>5000 32.7 14.5 12.1 9.8 Mirror f>20000 30.3 13.2 10.7 8.6
7.0 6.1 5.3
18.5 18.7 18.1
40.1 38.1 34.9
11.0 60774 11.8 43294 12.8 28488
18.1 21.6 26.7
f>1 44.8 19.9 16.4 13.4 f>5000 40.7 18.5 15.0 12.5 f>20000 39.4 16.1 13.5 11.2
9.5 9.3 8.4
26.6 25.6 23.3
46.8 41.6 35.2
14.7 60774 15.0 43294 16.8 28488
15.4 16.3 16.8
f>1 53.1 25.1 21.4 18.1 f>5000 52.4 23.0 19.3 16.6 f>20000 45.9 19.4 16.5 14.0
15.2 13.7 11.2
46.6 30.7 24.6
22.9 41.2 32.6
25.0 60774 23.4 43294 21.6 28488
9.4 10.9 12.5
Lin
M/L
Comparing Distributional and Mirror Translation Similarities
4.3
331
Error Analysis
The kind of evaluation we presented above has a few shortcomings. The main reference we used for synonymy does not have a large number of synonyms per entry, and if one of our objectives is to extend existing resources, we cannot estimate the interest of the items we find that are absent from that reference. Using a larger thesaurus such as Moby only partially solves the problem, since there is no distinction between lexical relations, and some related terms do not correspond to any classical lexical function. In order to evaluate more precisely our output, but on a much smaller scale, we have looked at a sample of items that are absent from the reference, to measure the amount of actual errors. To do this, we took a number of terms which are the first candidates proposed by the mirror approach for a target, but are absent from WordNet. We found a number of different phenomena, on a sample of 100 cases: – 25% of words that are part of a multi-word expression which were probably aligned to the same target translation, such as sea/urchin; – 18% of words that are actually synonyms, according to other thesauri we could check manually,13 such as torso/chest; – 13% hypernyms, listed in WordNet or in www.thesaurus.com, e.g. twitch/ movement. – 6% morphologically related items such as accountant/accounting, probably because of a pos-tag ambiguity in the pivot language, here the French word comptable, which can be a noun or an adjective. Among the remaining errors that are probably common, some are due to a polysemy of a pivot translation (e.g.: English word aplomb translated into French as assurance which can also mean insurance in English). This is hard to quantify exactly in the sample without looking in detail at all related aligned word pairs. On the remaining various errors, some bear on rare occurrences in the input corpus, that we should have filtered out beforehand. All in all, we can see there is room for easy improvement. Only polysemy is a hard problem to address, and this is so for any kind of approach relying on distributional data. In addition to that, we are currently looking at items that were not considered in the evaluation because there was no synonym for them in Wordnet, but for which there are mirror translations (such as whopper/lie). Although we cannot yet quantify the presence of truly related lexical items, the few examples we looked at seem to reflect the analysis above.
5
Related Work
There are several lines of work that are comparable to what we presented here, with a variety of objectives, evaluation methodologies and input data. Paraphrase extraction shares some of our objectives and some of the resources we 13
Such as http://www.thesaurus.com.
332
P. Muller and P. Langlais
considered. Synonym extraction and thesaurus building also overlap our goals and evaluation methods. Also, work striving to design and compare semantic similarities is the closest in nature, if not in the objectives. Paraphrase acquisition is usually evaluated on the acceptability of substitutions in context, and only small-scale human judgments of the results give an indication of the lexical functions captured: [13] reports that 90% of their pattern-based extracted paraphrases are valid, mixing synonyms, hypernyms and coordinate terms, but with no indication of coverage. Similarly, [14] or [15] precisely evaluate the presence of synonyms on similarity lists on a small subsets of synonym/antonym pairs, which makes it hard to extrapolate on the kind of data we used, where we aim at a much larger coverage. Closer to our methodology, several studies evaluate the classification of a set of word pairs as synonyms or not; either directly on the candidates selected for each target, as we do here, or on resampled word pairs test sets that make the task accessible to common learning techniques. The former method (non-resampled, which is also ours) is more realistic and of course gives rather low scores: [7] use alignment vectors on a set of language pairs, and syntactic argument vectors, and similarity is defined in a comparable way between the vectors; The study in [8] also uses a syntactic distributional similarity and a distance in a dictionarybased lexical network. The first study only looks at the first three candidates in Dutch, with respect to a synonym reference (Euro WordNet) and considers only nouns. Scores P1 range from 17.7 to 22.5% on alignment candidates, with distributional similarity at 8%, and the combination at 19.9%. The authors have an updated experiment in [16], still on Dutch nouns, and reach 30% for P1, but do not explain the differences in their setup. The second study applies linear regressions on all similarity scores, with different target frequencies and similarity thresholds, and reaches a maximum f-score of 23% on nouns and 30% on verbs on one of its many settings. The reference here was the union of WordNet and the Roget’s, which places it somewhere between WordNet and Moby with respect to coverage. A different setting is resampled evaluation, where a classifier is trained and tested on a set of word pairs with an priori ratio of synonyms and non-synonyms. It is only relevant if a good preselection method allows one to reach the assumed proportions of synonyms in the training and test sets [17]. Our results could actually be considered as an input to such methods. Taken alone, distributional similarities in [18] show results that are comparable to ours or better on Moby, but slightly lower on WordNet. His test set is larger, and split differently with respect to word frequencies. His results are lower than what we obtain here with Lin’s own data (as we noticed also about [7]), so we can assume that our comparison is representative with respect to the distributional approach and is a fair comparison. Mirror translations thus reach comparable or better results than distributional similarity and alignment similarities for synonyms in English, and we have shown that the different methods can be usefully combined in a simple way. Besides,
Comparing Distributional and Mirror Translation Similarities
333
mirror translations are simpler to compute than the best similarities between n × n alignment or cooccurrent vectors, where n is the size of the vocabulary. As a secondary evaluation, authors often use TOEFL synonymy tests [19,6] where the task is to distinguish the synonym of a given word in a given context, among four candidate items. This is a sort of easier word disambiguation test where the task is to separate a word from unrelated distractors, instead of distinguishing between word senses. We are planning to test the mirror translations against such available benchmarks in a near future. Another way to evaluate the relevance of similarity measures between words is derived from the data collected by [20] where humans are asked to judge the similarity or relatedness of items on a scale. This is an interesting way of providing an intrinsic evaluation of these associations, but it covers only a very limited part of the vocabulary (about 300 words, with only a couple of associations for each).
6
Conclusion
Our different experiments confirm the variety of lexical associations one can find for word paired with so-called semantic similarity measures. While the mirror and the distributional approaches we considered in this work both seem correlated to the references considered, our objective is to be able to pinpoint more precise lexical functions, as they are needed for different tasks (paraphrase substitution, translation lexical choice, etc). With respect to synonyms, our experiments indicate that mirror translations provide a better filter than syntactic distribution similarity. While alignment data have been less studied as a source of similarity than syntactic distributions, we hope we succeeded in showing that they are worth the investigation. We also note that finding mirrors is computationally simpler than finding the better similarities between alignment or distributional vectors, the latter method being the closest in spirit to our approach. Our longer-term objective is to reproduce synonymy word pair supervised classification; any similarity alone scores quite low as a synonymy descriptor, but experiments, such as [17], show it is doable to reliably label word pairs with lexical functions if the proportion of candidates is more balanced than the very low natural proportion, and this means designing a filter as we do here. The complementarity of resources considered here is still an open question, although we show that intersecting similarities as simply as we did here is already providing some gain in precision. A more interesting path is probably to combine this with pattern-based approaches, either as another filter or to help selecting productive patterns to start with. The main problem for word similarity measures based on any kind of distribution regularity remains to deal with polysemy, especially when different senses have very different frequency use. Lastly, we plan to investigate the use of multiple language pairs to improve the precision of the predictions of the mirror approach.
334
P. Muller and P. Langlais
References 1. Edmonds, P., Hirst, G.: Near-Synonymy and lexical choice. Computational Linguistics 28(2), 105–144 (2002) 2. Michiels, A., Noel, J.: Approaches to thesaurus production. In: Proceedings of Coling 1982 (1982) 3. Kozima, H., Furugori, T.: Similarity between words computed by spreading activation on an english dictionary. In: Proceedings of the Conference of the European Chapter of the ACL, pp. 232–239 (1993) 4. Niwa, Y., Nitta, Y.: Co-occurrence vectors from corpora vs. distance vectors from dictionaries. In: Proceedings of Coling 1994 (1994) 5. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of Coling 1998, Montreal, vol. 2, pp. 768–774 (1998) 6. Freitag, D., Blume, M., Byrnes, J., Chow, E., Kapadia, S., Rohwer, R., Wang, Z.: New experiments in distributional representations of synonymy. In: Proceedings of CoNLL, pp. 25–32 (2005) 7. van der Plas, L., Tiedemann, J.: Finding synonyms using automatic word alignment and measures of distributional similarity. In: Proceedings of the COLING/ACL Poster Sessions, pp. 866–873 (2006) 8. Wu, H., Zhou, M.: Optimizing synonyms extraction with mono and bilingual resources. In: Proceedings of the Second International Workshop on Paraphrasing. Association for Computational Linguistics, Sapporo (2003) 9. Dyvik, H.: Translations as semantic mirrors: From parallel corpus to wordnet. In: The Theory and Use of English Language Corpora, ICAME 2002 (2002) 10. Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 597–604 (2005) 11. Zhitomirsky-Geffet, M., Dagan, I.: Bootstrapping distributional feature vector quality. Computational Linguistics 35(3), 435–461 (2009) 12. Weeds, J.E.: Measures and Applications of Lexical Distributional Similarity. PhD thesis, University of Sussex (2003) 13. Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (2001) 14. Lin, D., Zhao, S., Qin, L., Zhou, M.: Identifying synonyms among distributionally similar words. In: Proceedings of IJCAI 2003, pp. 1492–1493 (2003) 15. Curran, J.R., Moens, M.: Improvements in automatic thesaurus extraction. In: Proceedings of the ACL 2002 Workshop on Unsupervised Lexical Acquisition, pp. 59–66 (2002) 16. Lonneke, P., Tiedemann, J., Manguin, J.: Automatic acquisition of synonyms for French using parallel corpora. In: Proceedings of the 4th International Workshop on Distributed Agent-Based Retrieval Tools (2010) 17. Hagiwara, M., Ogawa, Y., Toyama, K.: Supervised synonym acquisition using distributional features and syntactic patterns. Journal of Natural Language Processing 16(2), 59–83 (2009) 18. Ferret, O.: Testing semantic similarity measures for extracting synonyms from a corpus. In: Proceeding of LREC (2010) 19. Turney, P.: A uniform approach to analogies, synonyms, antonyms, and associations. In: Proceedings of Coling 2008, pp. 905–912 (2008) 20. Miller, G., Charles, W.: Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1), 1–28 (1991)
Generic Solution Construction in Valuation-Based Systems Marc Pouly Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg
Abstract. Valuation algebras abstract a large number of formalisms for automated reasoning and enable the definition of generic inference procedures. Many of these formalisms provide some notions of solutions. Typical examples are satisfying assignments in constraint systems, models in logics or solutions to linear equation systems. Contrary to inference, there is no general algorithm to compute solutions in arbitrary valuation algebras. This paper states formal requirements for the presence of solutions and proposes a generic algorithm for solution construction based on the results of a previously executed inference scheme. We study the application of generic solution construction to semiring constraint systems, sparse linear systems and algebraic path problems and show that the proposed method generalizes various existing approaches for specific formalisms in the literature. Keywords: solution construction in valuation algebras, local computation, semiring constraint systems, sparse matrix techniques.
1
Introduction
In recent years, various formalisms for automated inference have been proposed. Important examples are probability potentials from Bayesian networks, belief functions from Dempster-Shafer theory, different constraint systems and logics, Gaussian potentials and density functions, relational algebra, possibilistic formalisms, systems of equations and inequalities over fields and semirings and many more. Inference based on these formalisms is often a computationally hard task which is successfully addressed by methods exploiting tree-decompositions. Moreover, since all the above formalisms satisfy some common algebraic properties pooled in the valuation algebra framework [7, 18], it is possible to provide generic tree-decomposition algorithms for the computation of inference with arbitrary valuation algebras. Thus, instead of re-inventing such methods for each different formalism, it is sufficient to verify a small axiomatic system to gain access to efficient generic procedures and implementations [12]. This is known as the local computation framework. Although not limited to, many valuation algebras are defined over variable systems and express in some sense which assignments of values to variables are “valid” or “preferred” over others. Typical examples are models in logics, satisfying assignments in crisp constraint systems, C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 335–346, 2011. c Springer-Verlag Berlin Heidelberg 2011
336
M. Pouly
solutions to linear equations or assignments that optimize the cost function of a soft constraint system. Subsequently, we refer to such assignments as solutions. In the general case, solutions cannot be obtained by applying generic tree-decomposition procedures. Given a constraint system for example, these algorithms only tell us whether the system is satisfiable, but they generally do not find solutions. However, a common way to describe local computation procedures is by message-passing on a graphical structure called join tree. The nodes of a join tree exchange messages and combine incoming messages to their content until the result to the inference problem is found. For specific valuation algebras, it has been shown that solutions can be computed efficiently from the results of the inference process, i.e. from the node contents at the end of the message-passing. This process has been described for crisp constraint systems, linear inequalities and most probable assignments in Bayesian networks in [3] and for specific soft constraints in [17]. A generalization of the latter to a larger class of semiring valuation algebras [8] can be found in [11]. Also, it is known that Gaussian variable elimination in sparse regular systems corresponds to local computation in the valuation algebra of linear systems [7], where solution construction is achieved by the usual back-substitution process. All these approaches construct solutions based on the results of a previous local computation process, but they are always limited to specific valuation algebras. In this paper, we aim for a generic analysis of solution construction. We state sufficient requirements for the existence of solutions in valuation algebras and derive a generic algorithm for the identification of solutions. In the second part, we show how existing approaches are generalized by this framework and also apply solution construction to other valuation algebras which have not yet been considered under this perspective, i.e. Gaussian potentials and quasi-regular valuations. The family of quasi-regular valuation algebras is used to model path problems with varying semantics. They are for the first time shown to instantiate the valuation algebra framework which, besides generic solution construction, is the second key contribution of this paper. The outline of this paper is as follows: We first give a short introduction to the valuation algebras, state the inference problem as the main computational task and present the fusion algorithm for the solution of inference problems with arbitrary valuation algebras. Section 3 gives general requirements for the presence of solutions in a valuation algebra and lists some simple properties which are used in Section 3.1 to define a generic solution construction scheme. Finally, we study in Section 4 several instantiations of this framework and also show how existing approaches in the literature are generalized by this scheme.
2
Valuation Algebras and Local Computation
The basic elements of a valuation algebra are so-called valuations. Intuitively, a valuation can be regarded as a representation of knowledge about the possible values of a set of variables. If r denotes the universe of variables, then each valuation φ refers to a finite set of variables d(φ) ⊆ r called its domain. Let P(r) be the power set of r and Φ a set of valuations with domains in P(r). We assume three operations defined in (Φ, P(r)):
Generic Solution Construction in Valuation-Based Systems
337
– Labeling: Φ → P(r); φ → d(φ), – Combination: Φ × Φ → Φ; (φ, ψ) → φ ⊗ ψ, – Variable Elimination: Φ × r → Φ; (φ, X) → φ−X . The following axioms are imposed on (Φ, P(r)): 1. 2. 3. 4.
Commutative Semigroup: Φ is associative and commutative under ⊗. Labeling: For φ, ψ ∈ Φ, d(φ ⊗ ψ) = d(φ) ∪ d(ψ). Variable Elimination: For φ ∈ Φ and X ∈ d(φ), d(φ−X ) = d(φ) − {X}. Commutativity of Elimination: For φ ∈ Φ with d(φ) = s and X, Y ∈ s, (φ−X )−Y = (φ−Y )−X .
5. Combination: For φ, ψ ∈ Φ with X ∈ / d(φ) and X ∈ d(ψ), (φ ⊗ ψ)−X = φ ⊗ ψ −X . These axioms require natural properties regarding knowledge modeling. The first axiom indicates that if knowledge comes in pieces, the sequence does not influence their combination. The labeling axiom tells us that the combination of valuations gives knowledge over the union of the involved variables. The third axiom ensures that eliminated variables disappear from the domain of a valuation. The fourth axiom says that the order of variable elimination does not matter, and the combination axiom states that we may either combine a new piece to the already given knowledge and focus afterwards to the desired domain, or we first eliminate the uninteresting parts of the new knowledge and combine it afterwards. A system (Φ, P(r)) satisfying the above axioms is called a valuation algebra. More general definitions of valuation algebras exist to cover formalisms based on general lattices instead of variable systems [7]. But since solutions are variable assignments, the above definition is appropriate. Due to axiom 4 the elimination order of variables is not significant. We may therefore write φ↓s = φ−{X1 ,...,Xk } if a non-empty set of variables {X1 , . . . , Xk } = d(φ) − s is eliminated. This is called the projection of φ to s ⊂ d(φ). A listing of formalisms that adopt the structure of a valuation algebra is given in the introduction. We refer to Section 4 and [7, 11, 13] for further examples and next focus on the main computational interest in valuation algebras. 2.1
The Inference Problem
Given a set of valuations {φ1 , . . . , φn } ⊆ Φ and a query x ⊂ d(φ1 ) ∪ . . . ∪ d(φn ), the inference problem consists in computing (φ1 ⊗ · · · ⊗ φn )↓x = (φ1 ⊗ · · · ⊗ φn )−{X1 ,...,Xk }
(1)
for {X1 , . . . , Xk } = (d(φ1 ) ∪ . . . ∪ d(φn )) − x. The complexity of combination and variable elimination generally depends on the size of the factor domains and
338
M. Pouly
often shows an exponential behaviour. According to axiom 2 and 3, the domains of valuations grow under combination and shrink under variable elimination. Efficient inference algorithms therefore confine the size of intermediate results, which can be achieved by alternating the two operations. This strategy is called local computation, and the valuation algebra axioms proved sufficient for the definition of general local computation schemes which solve inference problems independently of the underlying formalism. Local computation algorithms include fusion [16], bucket elimination [3] and collect [18] for single queries and more specialized architectures for the computation of multiple queries [5–7, 9]. 2.2
The Fusion Algorithm
Let us first consider the elimination of a variable Y ∈ d(φ1 ) ∪ . . . ∪ d(φn ) from a set {φ1 , . . . , φn } ⊆ Φ of valuations. This operation can be performed as follows: / d(φi )} where ψ = φi . (2) FusY ({φ1 , . . . , φn }) = {ψ −Y } ∪ {φi : Y ∈ i:Y ∈d(φi )
The fusion algorithm then follows by a repeated application of this operation: (φ1 ⊗ · · · ⊗ φn )−{X1 ,...,Xk } = FusXk (. . . (FusX1 ({φ1 , . . . , φn })). Proofs are given in [7]. In every step i = 1, . . . , k of the fusion algorithm, the combination in (2) creates an intermediate factor ψi with domain d(ψi ). Then, the variable Xi is eliminated only from ψi in (2). We define λ(i) = d(ψi ) − {Xi } called label and observe that λ(k) = x. The domains of all intermediate results of the fusion algorithm are therefore bounded by the largest label plus one. In other words, the smaller the labels are, the more efficient is local computation. We further remark that the labels depend on the chosen elimination sequence for the variables {X1 , . . . , Xk }. Regrettably, finding the elimination sequences that leads to smallest labels is NP-complete [1], but we have good heuristics that achieve reasonable execution time [4]. The fusion algorithm can be represented graphically: We create a node for each step i = 1, . . . , k carrying label λ(i). Then, if ψj with j < i occurs as a factor in the combination (2) of ψi , a directed edge is drawn from node j to node i. Node i is then called the child i = ch(j) of node j. The resulting graph is a tree directed towards the root node k that satisfies the running intersection property [7], i.e. if i, j are two nodes and X ∈ λ(i) ∩ λ(j), then X ∈ λ(m) for all nodes m on the unique path between i and j. Labeled trees satisfying this property are called join trees.
3
Solutions in Valuation Algebras
Let ΩX be the set of possible values of a variable X ∈ r. Then, the set of possible assignments to a non-empty set of variables s ⊆ r is given by the Cartesian product Ωs = X∈s ΩX . We refer to the elements in Ωs as configurations of s and define by convention for the empty variable set Ω∅ = { }. Further, we
Generic Solution Construction in Valuation-Based Systems
339
write x↓t for the projection of x ∈ Ωs to t ⊆ s. The definition of solutions in general valuation algebras must be independent of the actual semantics. Instead, we define solutions in terms of a structural property. Assume a valuation φ ∈ Φ with domain d(φ) = s, t ⊆ s and a configuration x ∈ Ωt . We write Wφt (x) for the set of all configurations y ∈ Ωs−t such that (x, y) leads to a “preferred” value of φ among all configurations z ∈ Ωs with z↓t = x. It is required that the extension y can either be computed directly by extending x to the domain s, or step-wise by first extending x to u and then to s for t ⊆ u ⊆ s. Definition 1. For φ ∈ Φ with t ⊆ s = d(φ) and x ∈ Ωt a set Wφt (x) is called configuration extension set of φ from t to s, given x, if for all u with t ⊆ u ⊆ s, Wφt (x) = z ∈ Ωs−t : z↓u−t ∈ Wφt ↓u (x) and z↓s−u ∈ Wφu (x, z↓u−t ) . Solutions lead to the “preferred” value of φ among all configurations in Ωs . Hence, if such a system of configuration extension sets is present in a valuation algebra, we may characterize a solution to φ ∈ Φ as an extensions from the empty configuration to the domain of φ. Definition 2. The solution set cφ of φ ∈ Φ is defined as cφ = Wφ∅ ( ). Examples of such systems of configuration extension sets and their induced solution sets for concrete valuation algebras are given in Section 4. The following lemma is an immediate consequence of the definition of solutions. It says that every solution to a projection of φ is also a projection of some solution to φ. Lemma 1. For φ ∈ Φ and t ⊆ d(φ) it holds that cφ↓t = c↓t φ . We therefore refer to a projection x↓t of a solution x ∈ cφ as a partial solution to φ with respect to t ⊆ d(φ). If s, t ⊆ d(φ) are two subsets of the domain of φ ∈ Φ, the partial solutions of φ with respect to s ∪ t may be obtained by extending the partial solutions of φ with respect to s from s ∩ t to t. This is the statement of the following theorem that follows from Definition 1 and Lemma 1. Theorem 1. For s, t ⊆ d(φ) we have ↓t−s s∩t ↓s∩t and z ∈ W (z ) . cφ↓s∪t = z ∈ Ωs∪t : z↓s ∈ c↓s ↓t φ φ 3.1
Generic Solution Construction
We now focus on the efficient computation of solutions for a valuation φ ∈ Φ that is given as factorization φ = φ1 ⊗ . . . ⊗ φn . Since computing φ is in general intractable, the proposed method assembles solutions to φ using partial solution extension sets obtained from the results of a previous run of the fusion algorithm (or any other local computation scheme). At the end of the fusion algorithm φ↓λ(k) is the only known projection of φ. But if we alternatively execute a
340
M. Pouly
multi-query local computation architecture, then φ↓λ(i) is obtained for all nodes i = 1, . . . , k. Knowing these projections would allow us to build the complete solution set cφ . Due to Lemma 1 and Definition 2, we have for the root node: ↓λ(k)
cφ
= Wφ∅↓λ(k) ( ).
(3)
If follows from Theorem 1 and the running intersection property that this partial solution set can be extended step-wise to the complete solution set cφ . Lemma 2. For i = k − 1, . . . , 1 and s = λ(k) ∪ . . . ∪ λ(i + 1) we have ↓s∪λ(i) λ(i)∩λ(ch(i)) ↓λ(i)∩λ(ch(i)) ↓λ(i)−s cφ = z ∈ Ωs∪λ(i) : z↓s ∈ c↓s and z ∈ W (z ) . φ φ↓λ(i) Note that the domains of the configuration extension sets computed in Lemma 2 are always bounded by the label λ(i) of the corresponding join tree node. Hence, this algorithm adopts the complexity of a local computation scheme. However, an important disadvantage is that we require a multi-query architecture to obtain the projections of φ to all labels. We therefore aim at a procedure which is based on the results of the fusion algorithm only. Lemma 1 displays how solution sets behave under the operation of projection. The following lemma supposes a similar property for combination and shows that the projections to λ(i) in (3) can then be replaced by the factors ψi obtained from the fusion algorithm. Lemma 3. If configuration extension sets satisfy the property that for all φ1 , φ2 ∈ Φ with d(φ1 ) = s, d(φ2 ) = t, s ⊆ u ⊆ s ∪ t and x ∈ Ωu we have Wφu∩t (x↓u∩t ) ⊆ Wφu1 ⊗φ2 (x), 2
(4)
then it holds for i = k − 1, . . . , 1 and s = λ(k) ∪ . . . ∪ λ(i + 1) that ↓s∪λ(i) λ(i)∩λ(ch(i)) ↓λ(i)∩λ(ch(i)) ↓λ(i)−s cφ ⊇ z ∈ Ωs∪λ(i) : z↓s ∈ c↓s ∈ Wψi (z ) φ and z The proof, given in [13], is based on the correctness of the Shenoy-Shafer architecture [7]. Hence, if we assume that configuration extension sets are always non-empty, then at least one solution to φ can be computed as follows: We execute a complete run of the fusion algorithm, build the configuration extension ↓λ(k) to set for the root node using (3) and apply Lemma 3 to step-wise extend cφ the domain d(φ). The result of this process is a non-empty subset of the solution set cφ . Alternatively, the proof of Lemma 3 also shows that if equality holds in (4) then it also does in the statement below. In this case all solutions can be found based on the results of the fusion algorithm. Theorem 2. If inclusion (4) holds and configuration extension sets are nonempty, then a solution can be found based on the results of the fusion algorithm. If equality holds in (4), then all solutions can be found with the same procedure.
Generic Solution Construction in Valuation-Based Systems
4
341
Instantiations
We now survey some examples of valuation algebras and show that they provide a suitable notion of solutions to apply generic solution construction. Semiring Constraint Systems. Semirings are algebraic structures with two binary operations + and × over a set A of values. A tuple A, +, ×, 0, 1 is called commutative semiring if both operations are associative and commutative and if × distributes over +. 0 and 1 are the neutral elements with respect to + and ×. A semiring valuation φ with domain d(φ) = s ⊆ r is a function φ : Ωs → A that associates a value from a commutative semiring with each configuration x ∈ Ωs . Combination and variable elimination are defined as follows: – Combination: For φ, ψ ∈ Φ with d(φ) = s, d(ψ) = t and x ∈ Ωs∪t φ ⊗ ψ(x) = φ(x↓s ) × ψ(x↓t ).
(5)
– Variable Elimination: For φ ∈ Φ with X ∈ d(φ) = s and x ∈ Ωs−{X} φ−X (x) =
φ(x, z).
(6)
z∈ΩX
It was shown by [8] that every commutative semiring induces a valuation algebra by the above mapping and operations. Among them, semiring constraint systems are of particular interest. These are the valuation algebras induced by so-called c-semirings [2]. Here, we only require the weaker property of idempotency, i.e. for all a ∈ A we have a+a = a. It can easily be shown that idempotent semirings provide a partial order [11] satisfying a+ b = sup{a, b} for all a, b ∈ A. Moreover, if this order is total, we have a + b = max{a, b}. Rewriting the inference problem (1) with empty query for this family of valuation algebras gives φ↓∅ ( ) = max{φ(x), x ∈ Ωs }.
(7)
Hence, the inference problem for valuations induced by totally ordered, idempotent semirings turns into an optimization task. This covers crisp constraints induced by the Boolean semiring {0, 1}, max, min, 0, 1 , weighted constraints induced by the tropical semiring N ∪ {0, +∞}, min, +, ∞, 0 , probabilistic constraints induced by the t-norm semiring [0, 1], max, ×, 0, 1 or bottleneck constraints induced by the semiring R ∪ {−∞, ∞}, max, min, −∞, ∞ . Equation (7) motivates the following definition of configuration extension sets for valuation algebras induced by totally ordered, idempotent semirings: For φ ∈ Φ with d(φ) = s, t ⊆ s and x ∈ Ωt we define Wφt (x) = y ∈ Ωs−t : φ(x, y) = φ↓t (x) . (8) Theorem 3. Configuration extension sets in valuation algebras induced by totally ordered, idempotent semirings satisfy the property of Definition 1.
342
M. Pouly
This follows directly from (8) and shows that configuration extension sets in semiring constraint systems instantiate the framework of Section 3. We next specialize the general solution sets in Definition 2: (9) cφ = Wφ∅ ( ) = y ∈ Ωs : φ(x) = φ↓∅ ( ) . Note that this indeed corresponds to the notion of solutions in constraint systems derived in equation (8). Furthermore, we also see that several possibilities to define configuration extension sets may exist in a valuation algebra. Instead of the configurations that map to the maximum value, we could also consider all other configurations that do not satisfy this property and modify (8) accordingly which then equals the search for counter-models. This liberty comes from the fact that Definition 1 does not imposes semantical restrictions on configuration extension sets as for example giving a definition of “preferred” values. Lemma 4. In a valuation algebra induced by a totally ordered, idempotent semiring we have for φ1 , φ2 ∈ Φ with d(φ1 ) = s, d(φ2 ) = t, s ⊆ u ⊆ s ∪ t and x ∈ Ωu Wφu∩t (x↓u∩t ) ⊆ Wφu1 ⊗φ2 (x). 2 Proof. Assume x ∈ Ωu and y ∈ Wφu∩t (x↓t∩u ). It follows from equation (8) that 2 φ1 (x↓s ) × φ2 (x↓u∩t , y) = φ1 (x↓s ) × φ↓u∩t (x↓u∩t ). 2 We conclude y ∈ Wφu1 ⊗φ2 (x) from applying axiom 5 to the above expression, i.e. (x↓u∩t ) (φ1 ⊗ φ2 )(x, y) = φ1 (x↓s ) × φ2 (x↓u∩t , y) = φ1 (x↓s ) × φ↓u∩t 2
= φ1 ⊗ φ↓u∩t (x) = (φ1 ⊗ φ2 )↓u (x). 2 Finally, we also conclude from (8) that these configuration extension sets are always non-empty. Altogether, this meets the requirements of Theorem 2. Therefore, applying the generic solution construction algorithm to valuation algebras induced by totally ordered, idempotent semirings always identifies at least one solution based on the results of a previously executed run of the fusion algorithm. For crisp constraints and probabilistic constraints this reproduces the approach presented in [3]. In addition, the class of valuation algebras induced by totally ordered, idempotent semirings also generalizes the formalisms studied in [17] and thus the corresponding scheme for computing solutions by local computation. However, it is shown in [11] that generally only inclusion holds in (4). This is for example the case for bottleneck constraints. The following theorem states a sufficient semiring property to guarantee equality which then means that all solutions are found by the above procedure. Lemma 5. If a totally ordered, idempotent semiring satisfies that for a, b, c ∈ A and c = 0, a < b implies a × c < b × c, then equality holds in (4). The proof is similar to Lemma 4 but requires to exclude the case φ↓∅ ( ) = 0. For φ ∈ Φ with d(φ) = s, φ↓∅ ( ) = 0 implies that cφ = Ωs . Hence, all configurations are solutions and there is no need to determine cφ algorithmically. Excluding this case is therefore not limiting. The complete proof is given in [11].
Generic Solution Construction in Valuation-Based Systems
343
Linear Equation Systems. We discuss solution construction in linear systems by focussing on symmetric, positive-definite systems. Some comments on general systems will be given below. In this context, it is more convenient to work with index sets instead of variables directly. Hence, we slightly change our notation and consider variables Xi taking indices from a set r = {1, . . . , m}. A system AX = b is said to be an s-system, if X is the variable vector whose components Xi have indices in s ⊆ r, A is a real-valued, symmetric, positive-definite s × s matrix and b is a real s-vector. Such systems are fully determined by the pair (A, b), and we write d(A, b) = s for the domain of this system. By convention we write ( , ) for the only possible system with empty domain. Now, suppose that we want to eliminate the variable Xi from the s-system AX = b with i ∈ s. We decompose the system with respect to {i} and s − {i} and obtain X{i} b{i} A{i},{i} A{i},s−{i} = . As−{i},{i} As−{i},s−{i} Xs−{i} bs−{i} Then, the operation of variable elimination is defined as
−i A, b = As−{i},s−{i} − As−{i},{i} (A{i},{i} )−1 A{i},s−{i} , bs−{i} − As−{i},{i} (A{i},{i} )−1 b{i} . This corresponds to standard Gaussian variable elimination. We remark that the matrix component of the right-hand system is still symmetric, positive-definite. Next, consider an s-system A1 X1 = b1 and a t-system A2 X2 = b2 with s, t ⊆ r. The combination of the two systems is defined by component-wise addition
(A1 , b1 ) ⊗ (A2 , b2 ) = A1↑s∪t + A2↑s∪t , b1↑s∪t + b2↑s∪t . (10) The notation A↑s∪t and b↑s∪t means vacuous extension to the union domain s∪t by adding a corresponding number of zeros. If we write Φ for the set of all possible systems with domains in r, then the algebra (Φ, P(r)) is isomorphic to the valuation algebra of Gaussian potentials studied in [7, 13]. In other words, (Φ, P(r)) is itself a valuation algebra. Factorizations (A, b) = (A1 , b1 ) ⊗ . . . ⊗ (An , bn ) of symmetric, positive-definite systems reflect the sparsity pattern contained in the matrix A. In contrast to semiring valuation algebras, equation systems are of polynomial size. Factorizations may therefore be produced by decomposing an existing system, but there are also applications where such factorizations occur naturally. An important example are the normal equations in the least squares method [13]. We next define configuration extension sets for symmetric, positive-definite systems: For a real t-vector x, φ = (A, b) and t ⊆ d(φ), (11) Wφt (x) = (As−t,s−t )−1 bs−t − (As−t,s−t )−1 As−t,t x . This satisfies the property of Definition 1. Hence, we obtain from Definition 2 cφ = Wφ∅ ( ) = A−1 b . Note that cφ indeed corresponds to the singleton set of the unique solution x = A−1 b to the symmetric, positive-definite system AX = b.
344
M. Pouly
Lemma 6. Symmetric, positive-definite systems satisfy the property that for all φ1 , φ2 ∈ Φ with d(φ1 ) = s, d(φ2 ) = t, s ⊆ u ⊆ s ∪ t and u-vector x we have (x↓u∩t ) = Wφu1 ⊗φ2 (x). Wφu∩t 2 The straightforward proof can be found in [13]. Again, this property allows the application of the generic solution construction algorithm of Section 3.1 to solve factorized or decomposed, symmetric, positive-definite systems. In fact, this corresponds to the standard back-substitution process that follows Gaussian variable elimination. In combination with local computation for the variable elimination process, this is generally referred to as sparse matrix technique [14]. The above theory can also be applied to the valuation algebra of arbitrary linear systems developed in [7]. Since general systems may have no or infinitely many solutions, solution construction identifies the affine solution space rather than a single solution [13]. Finally, it is also shown in [7] that the valuation algebra of linear systems belongs to a larger class of valuation algebras called context valuation algebras that further includes systems of inequalities and various logics. All these formalisms provide a suitable notion of configuration extension sets and qualify for generic solution construction. Quasi-Regular Valuation Algebras. Many important applications in computer science can be reduced to path problems in labeled graphs with values from a semiring. This is known as the algebraic path problem and typically includes the computation of shortest paths, connectivities, maximum capacities or reliabilities, but also other applications that are not directly related to graphs such as partial differentiation, matrix multiplication or tasks related to Markov chains. Examples are given in [15]. If r denotes a finite set of indices, the algebraic path problem requires to solve a fixpoint equation system X = MX + b where X is an s-vector of variables with indices from s ⊆ r, M is an s × s matrix with values from a semiring and b is an s-vector of semiring values. Such a system provides a solution M∗ b if the underlying semiring A, +, ×, 0, 1 is quasi-regular, i.e. if for each element a ∈ A there exists a∗ ∈ A such that aa∗ +1 = a∗ a+1 = a∗ . The solution M∗ to the fixpoint equation system can then be computed by the wellknown Floyd-Warshall-Kleene algorithm [10]. For example, the Boolean semiring is quasi-regular with 0∗ = 1∗ = 1. The tropical semiring of non-negative integers is quasi-regular with a∗ = 0 for all a ∈ N ∪ {0, ∞}, the probabilistic semiring is quasi-regular with a∗ = 1 for a ∈ [0, 1], and the bottleneck semiring is quasiregular with a∗ = ∞ for all a ∈ R ∪ {−∞, ∞}. If M represents an adjacency matrix of a weighted graph with edge weights from the Boolean semiring, then M∗ gives the connectivity matrix. Alternatively, we may choose the tropical semiring to obtain shortest distances or the probabilistic semiring for maximum reliabilities. Instead of computing M∗ directly, we focus on computing a single row of this matrix. In terms of shortest distances, this corresponds to the singlesource shortest distance problem. To determine the i-th row with i ∈ s we specify an s-vector b with b(i) = 1 and b(j) = 0 for all j ∈ s − {i}. Then, the solution to the system X = MX + b is M∗ b which clearly corresponds to the i-th row of M∗ . Similar to the above valuation algebra of symmetric, positive-definite
Generic Solution Construction in Valuation-Based Systems
345
systems, we represent such systems as pairs (M, b) and consider Φ to be the set of all possible s-systems with s ⊆ r. Then, (Φ, P(r)) again forms a valuation algebra for every quasi-regular semiring [13]. The combination rule is equal to (10) with semiring addition replacing addition of reals. Also, the operations of variable elimination are closely related: For φ = (M, b) with d(φ) = s and i ∈ s,
−i A, b = As−{i},s−{i} + As−{i},{i} (A{i},{i} )∗ A{i},s−{i} , bs−{i} + As−{i},{i} (A{i},{i} )∗ b{i} Configuration extension sets are defined for φ = (M, b) with d(φ) = s, t ⊆ s and an arbitrary t-vector x of values from a quasi-regular semiring as Wφt (x) = (M↓s−t,s−t )∗ (M↓s−t,t x + b↓s−t ) . This again fulfills the requirements of Definition 1. Specializing Definition 2 to quasi-regular valuation algebras then gives cφ = Wφ∅ = {M∗ b},
(12)
which indeed corresponds to the solution to the fixpoint system X = MX + b. Lemma 7. Quasi-regular valuation algebras satisfy the property that for all φ1 , φ2 ∈ Φ with d(φ1 ) = s, d(φ2 ) = t, s ⊆ u ⊆ s ∪ t and u-vector x we have Wφu∩t (x↓u∩t ) = Wφu1 ⊗φ2 (x). 2 Factorized fixpoint equation systems, which represent the sparsity pattern in the total matrix M, can thus be solved by generic solution construction. In case of the single-source shortest path problem, the fusion algorithm computes the shortest distances between a selected node and all other nodes and solution construction then identifies the actual paths for these distances.
5
Conclusion
The valuation algebra framework abstracts inference formalisms and enables the definition of generic inference procedures based on tree-decomposition techniques. Many important instances of this framework are defined over variable systems and determine the assignments of values to variables that are “preferred” over others. In contrast to inference, there is no generic procedure to determine such variable assignments, called solutions, although many specialized approached for particular formalisms exist. This paper states formal requirements for the presence of solutions in valuation algebras and derives a generic algorithm to compute single or all solutions to a factorized valuation. They are based on the intermediate results of the previously executed inference algorithm and therefore adopt the same complexity. In the second part of this paper, we instantiated the generic solution construction scheme to semiring constraint systems and linear systems over fields and observed that both instantiations
346
M. Pouly
correspond to the well-known specialized approaches in these application domains. Finally, we presented for the first time a new family of instances called quasi-regular valuation algebras used to represent and solve sparse path problems in semiring-weighted graphs. Here, the generic inference algorithm for valuation algebras is used to determine the optimum path weight, and solution construction delivers the corresponding sequence of graph nodes that describes the path.
References 1. Arnborg, S., Corneil, D., Proskurowski, A.: Complexity of finding embeddings in a k-tree. SIAM J. of Algebraic and Discrete Methods 8, 277–284 (1987) 2. Bistarelli, S., Montanari, U., Rossi, F., Verfaillie, G., Fargier, H.: Semiring-based csps and valued csps: Frameworks, properties and comparison. Constraints 4(3) (1999) 3. Dechter, R.: Bucket elimination: a unifying framework for reasoning. Artif. Intell. 113, 41–85 (1999) 4. Dechter, R.: Constraint Processing. Morgan Kaufmann Publishers, San Francisco (2003) 5. Jensen, F., Lauritzen, S., Olesen, K.: Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly 4, 269–282 (1990) 6. Kask, K., Dechter, R., Larrosa, J., Fabio, G.: Bucket-tree elimination for automated reasoning. Artif. Intell. 125, 91–131 (2001) 7. Kohlas, J.: Information Algebras: Generic Structures for Inference. Springer, Heidelberg (2003) 8. Kohlas, J., Wilson, N.: Semiring induced valuation algebras: Exact and approximate local computation algorithms. Artif. Intell. 172(11), 1360–1399 (2008) 9. Lauritzen, S., Spiegelhalter, D.: Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Stat. Soc. B 50, 157– 224 (1988) 10. Lehmann, D.: Algebraic structures for transitive closure. Technical report, Department of Computer Science, University of Warwick (1976) 11. Pouly, M.: A Generic Framework for Local Computation. PhD thesis, Department of Informatics, University of Fribourg (2008) 12. Pouly, M.: Nenok - a software architecture for generic inference. Int. J. on Artif. Intel. Tools 19, 65–99 (2010) 13. Pouly, M., Kohlas, J.: Generic Inference - A unifying Theory for Automated Reasoning. Wiley & Sons, Chichester (2011) 14. Rose, D.: A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. In: Read, R. (ed.) Graph Theory and Computing. Academic Press, London (1972) 15. Rote, G.: Path problems in graphs. Computing Suppl. 7, 155–198 (1990) 16. Shenoy, P.: Valuation-based systems: A framework for managing uncertainty in expert systems. In: Zadeh, L., Kacprzyk, J. (eds.) Fuzzy Logic for the Management of Uncertainty, pp. 83–104. Wiley & Sons, Chichester (1992) 17. Shenoy, P.: Axioms for dynamic programming. In: Gammerman, A. (ed.) Computational Learning and Probabilistic Reasoning, pp. 259–275. Wiley & Sons, Chichester (1996) 18. Shenoy, P., Shafer, G.: Axioms for probability and belief-function propagation. In: Shafer, G., Pearl, J. (eds.) Readings in Uncertain Reasoning, pp. 575–610. Morgan Kaufmann Publishers, San Francisco (1990)
Cross-Lingual Word Sense Disambiguation for Languages with Scarce Resources Bahareh Sarrafzadeh, Nikolay Yakovets, Nick Cercone, and Aijun An Department of Computer Science and Engineering, York University, Canada {bahar,hush,nick,aan}@cse.yorku.ca
Abstract. Word Sense Disambiguation has long been a central problem in computational linguistics. Word Sense Disambiguation is the ability to identify the meaning of words in context in a computational manner. Statistical and supervised approaches require a large amount of labeled resources as training datasets. In contradistinction to English, the Persian language has neither any semantically tagged corpus to aid machine learning approaches for Persian texts, nor any suitable parallel corpora. Yet due to the ever-increasing development of Persian pages in Wikipedia, this resource can act as a comparable corpus for EnglishPersian texts. In this paper, we propose a cross-lingual approach to tagging the word senses in Persian texts. The new approach makes use of English sense disambiguators, the Wikipedia articles in both English and Persian, and a newly developed lexical ontology, FarsNet. It overcomes the lack of knowledge resources and NLP tools for the Persian language. We demonstrate the effectiveness of the proposed approach by comparing it to a direct sense disambiguation approach for Persian. The evaluation results indicate a comparable performance to the utilized English sense tagger. Keywords: Word Sense Disambiguation, WordNet, Languages with Scarce Resources, Cross-Lingual, Extended Lesk, FarsNet, Persian.
1
Introduction
Human language is ambiguous, so that many words can be interpreted in multiple ways depending on the context in which they occur. While humans rarely think about the ambiguities of language, machines need to process unstructured textual information which must be analyzed in order to determine the underlying meaning. Word Sense Disambiguation (WSD) heavily relies on knowledge. Without knowledge, it would be impossible for both humans and machines to identify the words’ meaning. Unfortunately, the manual creation of knowledge resources is an expensive and time consuming effort, which must be repeated every time the disambiguation scenario changes (e.g., in the presence of new domains, different C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 347–358, 2011. c Springer-Verlag Berlin Heidelberg 2011
348
B. Sarrafzadeh et al.
languages, and even sense inventories) [1]. This is a fundamental problem which pervades approaches to WSD, and is called the knowledge acquisition bottleneck. With the huge amounts of information on the Internet and the fact that this information is continuously growing in different languages, we are encouraged to investigate cross-lingual scenarios where WSD systems are also needed. Despite the large number of WSD systems for languages such as English, to date no large scale and highly accurate WSD system has been built for the Farsi language due to the lack of labeled corpora and monolingual and bilingual knowledge resources. In this paper we propose a novel cross-lingual approach to WSD that takes advantage of available sense disambiguation systems and linguistic resources for the English language. Our approach demonstrates the capability to overcome the knowledge acquisition bottleneck for languages with scarce resources. This method also provides sense-tagged corpora to aid supervised and semi-supervised WSD systems. The rest of this paper is organized as follows: After reviewing related works in Section 2, we describe the proposed cross-lingual approach in Section 3, and a direct approach to WSD in Section 4; which is followed by evaluation results and a discussion in Section 5. In Section 6 our concluding remarks are presented and future extensions are proposed.
2
Related Work
We can distinguish different approaches to WSD based on the amount of supervision and knowledge they demand. Hence we can classify different methods into 4 groups [1]: Supervised, Unsupervised, Semi-supervised and Knowledge-based. Generally, supervised approaches to WSD have obtained better results than unsupervised methods. However, obtaining labeled data is not usually easy for many languages, including Persian as there is no sense tagged corpus for this language. The objective of Knowledge-based WSD is to exploit knowledge resources such as WordNet[2] to infer the senses of words in context. These methods usually have lower performance than their supervised alternatives, but they have the advantage of wider coverage, thanks to the use of large-scale knowledge resources. The recent advancements in corpus linguistics technologies, as well as the availability of more and more textual data encourage many researchers to take advantage of comparable and parallel corpora to address different NLP tasks. The following subsection reviews some of the related works which address WSD using a cross-lingual approach. 2.1
Cross-Lingual Approaches
Parallel corpora present a new opportunity for combining the advantages of supervised and unsupervised approaches, as well as an opportunity for exploiting translation correspondences in the text. Cross-lingual approaches to WSD
Cross-Lingual WSD for Languages with Scarce Resources
349
disambiguate target words by labelling them with the appropriate translation. The main idea behind this approach is the plausible translations of a word in context restricts its possible senses to a subset [3]. In recent studies [4–7], it has been found that approaches that use cross-lingual evidence for WSD attain state-of-the-art performance in all-words disambiguation. However, the main problem of these approaches lies in the knowledge acquisition bottleneck: there is a lack of parallel and comparable corpora for several languages including Persian - which can potentially be relieved by collecting corpora on the Web. To overcome this problem, we utilized Wikipedia pages in both Persian and English. Before introducing our WSD system a brief survey of WSD systems for the Persian language follows. 2.2
Related Work for Persian
The lack of efficient, reliable linguistic resources and fundamental text processing modules for the Persian language make it difficult for computer processing. In recent years there have been two branches of efforts to eliminate this shortage [8]. Some researchers are working to provide linguistic resources and fundamental processing units. FarsNet [9] is an ongoing project to develop a lexical ontology to cover Persian words and phrases. It is designed to contain a Persian WordNet in its first phase and grow to cover verbs’ argument structures in its second phase. The included words and phrases are selected according to BalkaNet[10] base concepts and the most frequent Persian words and phrases in utilized corpora. Therefore, Persian WordNet goes closely in the lines and principles of Princeton WordNet, EuroWordNet and BalkaNet to maximize its compatibility to these WordNets and to be connected to the other WordNets in the world to enable cross-lingual tasks such as Machine Translation, multilingual Information Retrieval and developing multilingual dictionaries and thesauri. FarsNet 1.0 relates synsets in each POS category by the set of WordNet 2.1 relations. FarsNet also contains inter-lingual relations connecting Persian synsets to English synsets (in Princeton WordNet 3.0). [11] exploits an English-Persian parallel corpus which was manually aligned at the word level and sense-tagged a set of observations as a training dataset from which a decision tree classifier is learned. [8] devised a novel approach based on WordNet, eXtended WordNet[12] and verb parts of FarsNet to extend the Lesk algorithm[13] and find the appropriate sense of a word in an English sentence. Since FarsNet was not released at the time of publishing this paper, they manually translated a portion of WordNet to perform WSD for the Persian side. [14] defined heuristic rules based on the grammatical role, POS tags and co-occurrence words of both the target word and its neighbours to find the best sense. Others work on developing algorithms with less reliance on linguistic resources. We refer to statistical approaches [15–17] using monolingual corpora for solving the WSD problem in Farsi texts. Also conceptual categories in a Farsi thesaurus have been utilized to discriminate senses of Farsi homographs in [18]. Our proposed approach is unique, when compared to most cross-lingual approaches, in the sense that we utilize a comparable corpus, automatically
350
B. Sarrafzadeh et al.
extracted from Wikipedia articles, which can be available for many language pairs even the languages with scarce resources, and our approach is not limited to sense tagged parallel corpora only. Second, thanks to the availability of FarsNet, our method tags Persian words using sense tags in the same language instead of using either a sense inventory of another language or translations provided by a parallel corpus. Therefore the results of our work can be applied to many monolingual NLP tasks such as Information Retrieval, Text Classification as well as bilingual ones including Machine Translation and Cross-Lingual tasks. Moreover, the extended version of the Lesk algorithm has never been exploited to address WSD for Persian texts. Finally, taking advantage of available mappings between synsets in WordNet and FarsNet, we were able to utilize an English sense tagger which uses WordNet as a sense inventory to sense tag Persian words.
3
Introducing the Cross-Lingual Approach: Persian WSD Using Tagged English Words
This approach consists of two separate phases. In the first phase we utilize an English WSD system to assign sense tags to words appearing in English sentences. In the second phase we transfer these senses to corresponding Persian words. Since by design these two phases are distinct, the first phase can be considered as a black box and different English WSD systems can be employed. What is more, the corresponding Persian words can be Persian pages in Wikipedia or Persian sentences in the aligned corpus. We created a comparable corpus by collecting Wikipedia pages which are available for both English and Persian languages and Persian articles are not shorter than 250 words. This corpus contains about 35000 words for the Persian side and 74000 words for English. Therefore, the Cross-Lingual system contains three main building blocks: English Sense Disambiguation (first phase), English to Persian Transfer (transition to the second phase) and Persian Sense Disambiguation (second phase). These components are described in the following sections. Figure 1 indicates the system’s architecture for the Cross-Lingual section. 3.1
English Sense Disambiguation
As mentioned, different English Sense Disambiguation systems can be employed in this phase. In this system we utilized the Perl-based application SenseRelate [19] for the English WSD phase. SenseRelate uses WordNet to perform knowledge-based WSD. This system allows a user to specify a range of settings to control the desired disambiguation. We selected the Extended Lesk algorithm which leads to the most accurate disambiguation [19]. As an input to SenseRelate we provided plain untagged text of English Wikipedia pages that was preprocessed according to application’s preconditions.
Cross-Lingual WSD for Languages with Scarce Resources
351
Fig. 1. Cross-Lingual System Architecture
We also provided a tweaked stopword list1 that is more extensive than the one which came bundled with the application. SenseRelate will tag all ambiguous words in the input English texts using WordNet as a sense repository. 3.2
English to Persian Transfer
Running SenseRelate for input English sentences, we have English words tagged with sense labels. Each of these sense labels corresponds to a synset in WordNet containing that word in a particular sense. Most of these synsets have been mapped to their counterparts in FarsNet. In order to take advantage of these English tags for assigning appropriate senses to Persian words, first we transfer these synsets from English to Persian using interlingual relations provided by FarsNet. As FarsNet is mapped to WordNet 3.0 there are two inter-lingual relations; equal-to and near-equal-to between FarsNet and WordNet synsets. Due to the relatively small size of FarsNet we used both relations and did not distinguish between them. Exploiting these mappings, we match each WordNet synset which is assigned to a word in an English sentence to its corresponding synset in FarsNet. For this part, we developed a Perl-based XML-Parser and integrated the results into the output provided by SenseRelate. Along with transferring senses, we also need to transfer Wikipedia pages from English to Persian. Here, we choose the pages which are available in both languages. Hence we can work with the pages describing the same title in Persian. 1
The initial list is available at http://members.unine.ch/jacques.savoy/clef/ persianST.txt which was modified and extended according to the application requirements.
352
3.3
B. Sarrafzadeh et al.
Persian Sense Disambiguation
There are two different heuristics for disambiguating senses [1]: – one sense per collocation: nearby words strongly and consistently contribute to determine the sense of a word, based on their relative distance, order, and syntactic relationship; – one sense per discourse: a word is consistently referred with the same sense within any given discourse or document; The first heuristic is applicable to any available parallel corpus for English Persian texts, and we can assign the same sense as the English word to its translation appearing in the aligned Persian sentence. In this case, we obtain a very high accuracy, although our system would be limited to this specific type of corpus. Alternatively, since parallel corpora are not easy to obtain for many language pairs, we utilize Wikipedia pages which are available in both English and Farsi as a comparable corpus. We used these pages in order to investigate the performance of our system on such corpus which is easier to collect for languages with scarce resources. Note that although Farsi pages are not the direct translation of English pages, the context is the same for all corresponding pages, which implies many common words appear in both pages. Consequently, we can assume domain-specific words appear with similar senses in both languages. Based on the second hypothesis as the context of both texts is the same, for each matched synset in FarsNet which contains a set of Persian synonym words, we find all these words in the Persian text and we assign the same sense as the English label to them. Since there may be English words which occurred multiple times in the text and they could receive different sense tags from SenseRelate, we transfer the most common sense to Persian equivalences. Here we can use either the “most frequent” sense provided by WordNet as the “most common” sense or choose the most local frequent sense (i.e., in that particular context). Since the second heuristic is more plausible we opted to apply the most frequent sense of each English word in that text to its Persian translations. As an example consider SenseRelate assigned the second sense of the noun “bank ” to this word in the following sentence: “a bank is a financial institution licensed by a government.” and this sense is the most frequent sense in this English article. The Persian equivalent noun (i.e., “bank ”) has six different senses. Among them we select the sense which is mapped to the second sense of word bank in WordNet and we assign this sense from FarsNet to “bank ”. We consider 3 possible scenarios: 1. An English word has more than one sense, while the equivalent Persian word only has one sense. So, SenseRelate disambiguates the senses for this English word, and the equivalent Persian word does not need disambiguation. For example “free” in English is a polysemic word which can mean both “able to act at will” and “costing nothing”, while we have different words for these
Cross-Lingual WSD for Languages with Scarce Resources
353
senses in Persian (“azad” and “majani” respectively). In this case we are confident that the transferred sense must be the correct sense for the Persian word. 2. Both the English and the Persian words are polysemous, so as their contexts are the same, the senses should be the same. In this case we use mappings between synsets in WordNet and FarsNet. For example, the word “branch” in English and its Persian equivalent “shakheh” both are polysemous with similar set of senses. For example, if SenseRelate assigned the 5th sense (i.e “a stream or river connected to a larger one”) of this word to its occurrence in an English sentence, the mapped synset in FarsNet would also correspond to this sense of the Persian “shakheh”. 3. The third scenario happens when an English word only has one sense, while the Persian equivalent has more than one. In this case, as the context of both texts are the same, the Persian word is more likely to occur with the same sense as the English word. For example the noun “Milk” in English has only one meaning, while its translation in Farsi (i.e., “shir”) has three distinct meanings: milk, lion and (water) tap. However, since SenseRelate assigns a synset with this gloss “a white nutritious liquid secreted by mammals and used as food by human beings” to this word, the first sense will be selected for “shir”. In summary, for all 3 possible scenarios we utilize the mappings from WordNet synsets to FarsNet ones. However, according to our evaluation results, the first case usually leads to more accurate results and the third case results in the lowest accuracy. Nontheless, when it comes to domain-specific words, all three cases result in a high precision rate.
4
Direct Approach: Applying Extended Lesk for Persian WSD
Thanks to the newly developed FarsNet, the Lesk method (gloss overlap) is applicable to Persian texts as well. Since it is worthwhile to investigate the performance of this Knowledge based method - which has not as yet been employed for disambiguating Persian words - and compare the results of both Cross-Lingual and Direct approaches, in the second part of this experiment, the Extended Lesk algorithm has been applied directly to Persian. 4.1
WSD Using the Lesk Algorithm
The Lesk algorithm uses dictionary definitions (gloss) to disambiguate a polysemous word in a sentence context. The original algorithm counts the number of words that are shared between two glosses. The more overlapping the glosses are, the more related the senses are. To disambiguate a word, the gloss of each of its senses is compared to the glosses of every other word in a phrase. A word is assigned to the sense whose gloss shares the largest number of words in common with the glosses of the other words.
354
B. Sarrafzadeh et al.
The major limitation to this algorithm is that dictionary glosses are often quite brief, and may not include sufficient vocabulary to identify related senses. An improved version of the Lesk Algorithm - Extended Lesk [20] - has been employed to overcome this limitation. 4.2
Extended Gloss Overlap
Extended Lesk algorithm extends the glosses of the concepts to include the glosses of other concepts to which they are related according to a given concept hierarchy. Synsets are connected to each other through explicit semantic relations that are defined in WordNet. These relations only connect word senses that are used in the same part of speech. Noun synsets are connected to each other through hypernym, hyponym, meronym, and holonym relations. There are other types of relations between different part of speeches in WordNet, but we focused on these four types in this paper. These relations are also available for Persian synsets in FarsNet. Thus, the extended gloss overlap measure combines the advantages of gloss overlaps with the structure of a concept hierarchy to create an extended view of relatedness between synsets. 4.3
Applying Extended Lesk to Persian WSD
In order to compare the results of Direct and Cross-Lingual approaches, the output from the Cross-Lingual phase is used as an input to the knowledge based (direct) phase. Each tagged word from the input is considered as a target word to receive the second sense tag based on the extended Lesk algorithm. We adopted the method described in [20] to perform WSD for the Persian language. Persian glosses were collected using the semantic relations implemented for FarsNet. STeP-1 [21] was used for tokenizing glosses and stemming the content words.
5 5.1
Evaluation Cross-Lingual Approach
The results of this method have been evaluated on comparable English and Persian Wikipedia pages. Seven human experts - who are all native Persian speakers - were involved in the evaluation process; they evaluated each tagged word as “the best sense assigned”, “almost accurate” and “wrong sense assigned”. The second option considers cases in which the assigned sense is not the best available sense for a word in a particular context, but it is very close to the correct meaning (not a wrong sense) which is influenced by the evaluation metric proposed by Resnik and Yarowsky in [22]. Currently the tagged words from each Wikipedia article were evaluated by one evaluator only. Evaluation results indicate an error rate of 25% for these pages. Table 1 summarizes these results.
Cross-Lingual WSD for Languages with Scarce Resources
355
Table 1. Evaluation Results Cross-Lingual
Direct
P
R F-Score P
R F-Score P
Best Sense 68% Almost Accurate 7% 0.35 Wrong Sense 25%
0.48
51% 9% 0.35 40%
0.44
Baseline R F-Score
39% 8% 0.35 53%
0.40
Our results indicate that the domain-specific words which usually occur frequently in both English and Persian texts are highly probable to receive the correct sense tag. Due to the relatively smaller size of Persian texts, this system suffers from a low recall of 35%. However, as Wikipedia covers more and more Persian pages every day, soon we will be able to overcome this bottleneck. According to the evaluation results, our Cross-Lingual method gained an Fscore2 of 0.48 which is comparable to 0.54 F-score of SenseRelate using Extended Lesk [19]. This indicates the performance of our approach can reach the F-score of the utilized English tagger. Employing a more accurate English sense tagger thus improves the WSD results for Persian words by far. This system can be further evaluated by comparing its output to the results of assigning either random senses or the first sense to words. Since the senses in FarsNet are not sorted based on their frequency of usage (as compared to WordNet), we decided to use the first sense appearing in FarsNet (for each POS). Assigning the first sense to all tagged Persian words, the performance decreased significantly in terms of accuracy. The results in Table 1 indicate that, applying our novel approach results in a 28% improvement in accuracy in comparison with this selected baseline. However, assigning the most frequent sense to Persian words would be a more realistic baseline which yields a better estimation for our system’s performance. Thus by the time the frequency of usage is provided for FarsNet senses, we anticipate that this problem will be minimized. 5.2
Direct Knowledge-Based Approach
As mentioned, the output of the Cross-lingual method was tagged again using the Direct approach. Overall, 53% of the words received a different tag using the Direct approach. Table 1 indicates the evaluation results for this approach. 5.3
Comparison: Knowledge Based vs. Cross-Lingual
Both systems employ the Extended Lesk algorithm. While the Cross-Lingual method applies Extended Lesk on the English side and transfers senses to Persian words, the Direct approach works with Persian text directly. In other words, 2
F-Score is calculated as 2 (1−ErrorRate)·Recall , where ErrorRate is the percentage of 1−ErrorRate+Recall words that have been assigned the wrong sense.
356
B. Sarrafzadeh et al.
the former considers the whole text as the context and assigns one sense per discourse and the latter considers surrounding words and assigns one sense per collocation. Furthermore, the Cross-Lingual method exploits WordNet for extending the glosses which covers more words, senses and semantic relations than FarsNet which is employed by the Direct method. The main advantage of the Cross-Lingual method is that we can utilize any highly accurate English sense disambiguator for the first phase while the Persian side remains intact. On the other hand, this approach assigns the same tag (the most common sense) to all occurrences of a word which sacrifices accuracy. Moreover, if there is no English text with the same context available for a Persian corpus, this method cannot be applied. However collecting comparable texts over the web is not difficult. Finally, when the bilingual texts are not the direct translation of one another the system coverage will be limited to common words in both English and Persian texts. So, Cross-Lingual method mainly works well for domain words and not for all the words appearing in the Persian texts. Although Persian WSD while working with Persian texts directly seems to be more promising the evaluation results indicate a better performance for the Cross-Lingual system. The reasons for this observation have been investigated and are as follows: 1. Lack of reliable NLP tools for the Persian language. While STeP-1 has just been made available as a tokenizer and a stemmer, there is no POS tagger for Persian which complicated the disambiguation process. 2. Lack of comprehensive linguistics resources for the Persian language. FarsNet is a very valuable resource for the Persian language. However it is still at a preliminary stage of development and does not cover all words and senses in Persian. In terms of size it is significantly smaller (10000 synsets) than WordNet (more than 117000 synsets) and it covers roughly 9000 relations between both senses and synsets. 3. More ambiguity for Farsi words. Disambiguating a Farsi word is a big challenge. Due to the fact that the short vowels are not written in the Farsi prescription, one needs to consider all types of homographs including heteronyms and homonyms. Moreover, there is no POS tagger to disambiguate Farsi words which dramatically increases the ambiguity for many Farsi words.
6
Conclusion and Future Work
A large number of WSD systems for widespread languages such as English is available. However, to date no large scale and highly accurate WSD system has been built for the Farsi language due to the lack of labeled corpora and monolingual and bilingual knowledge resources. In this paper we overcame this problem by taking advantage of English sense disambiguators, availability of articles in both languages in Wikipedia and the newly developed lexical ontology, FarsNet, in order to address WSD for Persian. The evaluation results of the Cross-lingual approach show a 28% improvement in
Cross-Lingual WSD for Languages with Scarce Resources
357
accuracy in comparison with the first-sense baseline. The Cross-Lingual approach performed better than the knowledge based approach which is directly applied to Persian sentences. However, one of the main reasons for this performance is that the lack of NLP tools and comprehensive knowledge resources for Persian introduces many challenges for systems investigating this language. This paper in the first step examined a novel idea for cross-lingual WSD in terms of plausibility, feasibility and performance. The ultimate results of our approach demonstrate a comparable performance to the utilized English sense tagger. Therefore, in the next step we will replace SenseRelate with another English sense tagger with a higher F-score. Gaining higher accuracy and recall for the Persian WSD system we can exploit it as a part of a bootstrapping system to create the first sense tagged corpus to aid supervised WSD approaches for the Persian language. Finally, as the available tools and resources improve for the Persian language, the Direct approach can be employed to address WSD for Persian texts directly when there is no comparable English text is available. Acknowledgements. We would like to thank Prof. Shamsfard from the Natural Language Processing Research laboratory of Shahid Beheshti University (SBU) for providing us with the FarsNet 1.0 package.
References 1. Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys (2009) 2. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to WordNet: An On-line Lexical Database. International Journal of Lexicography (1990) 3. Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: A statistical approach to sense disambiguation in machine translation. In: Proceedings of the Workshop on Speech and Natural Language (1991) 4. Diab, M., Resnik, P.: An unsupervised method for word sense tagging using parallel corpora. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002) 5. Mihltz, M., Pohl, G.: Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation. In: Proceedings of the 5th Conference on Language Resources and Evaluation (2006) 6. TufiS ¸ , D., Ion, R., Ide, N.: Fine-grained word sense disambiguation based on parallel corpora, word alignment, word clustering and aligned wordnets. In: Proceedings of the 20th International Conference on Computational Linguistics (2004) 7. Tufi¸s, D., Koeva, S.: Ontology-Supported Text Classification Based on CrossLingual Word Sense Disambiguation. In: Proceedings of the 7th International Workshop on Fuzzy Logic and Applications: Applications of Fuzzy Sets Theory (2007) 8. Motazedi, Y., Shamsfard, M.: English to persian machine translation exploiting semantic word sense disambiguation. In: 14th International CSI Computer Conference, CSICC 2009 (2009) 9. Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Famian, A., Bagherbeigi, S., Fekri, E., Monshizadeh, M., Assi, S.M.: Semi Automatic Development of FarsNet; The Persian WordNet. In: Proceedings of 5th Global WordNet Conference (2010)
358
B. Sarrafzadeh et al.
10. Stamou, S., Oflazer, K., Pala, K., Christoudoulakis, D., Cristea, D., Tufis, D., Koeva, S., Totkov, G., Dutoit, D., Grigoriadou, M.: BALKANET: A Multilingual Semantic Network for the Balkan Languages. In: Proceedings of the 1st Global WordNet Association Conference (2002) 11. Faili, H.: An experiment of word sense disambiguation in a machine translation system. In: International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2008 (2008) 12. Harabagiu, S.M., Miller, G.A., Moldovan, D.I.: Wordnet 2 - a morphologically and semantically enhanced resource (1999) 13. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th Annual International Conference on Systems Documentation (1986) 14. Saedi, C., Shamsfard, M., Motazedi, Y.: Automatic Translation between English and Persian Texts. In: In Proceedings of the 3rd Workshop on Computational Approaches to Arabic-script Based Languages (2009) 15. Mosavi Miangah, T., Delavar Khalafi, A.: Word Sense Disambiguation Using Target Language Corpus in a Machine Translation System (June 2005) 16. Soltani, M., Faili, H.: A statistical approach on persian word sense disambiguation. In: 2010 The 7th International Conference on Informatics and Systems, INFOS (2010) 17. Mosavi Miangah, T.: Solving the Polysemy Problem of Persian Words Using Mutual Information Statistics. In: Proceedings of the Corpus Linguistics Conference (CL 2007) (2007) 18. Makki, R., Homayounpour, M.: Word Sense Disambiguation of Farsi Homographs Using Thesaurus and Corpus. In: Advances in Natural Language Processing (2008) 19. Pedersen, T., Kolhatkar, V.: WordNet:SenseRelate:AllWords: a broad coverage word sense tagger that maximizes semantic relatedness. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Demonstration Session (2009) 20. Banerjee, S.: Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 805–810 (2003) 21. Shamsfard, M., Sadat Jafari, H., Ilbeygi, M.: STeP-1: A Set of Fundamental Tools for Persian Text Processing. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010) (2010) 22. Resnik, P., Yarowsky, D.: Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation. Nat. Lang. Eng. (1999)
COSINE: A Vertical Group Difference Approach to Contrast Set Mining Mondelle Simeon and Robert Hilderman Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2 {simeon2m,hilder}@cs.uregina.ca
Abstract. Contrast sets have been shown to be a useful mechanism for describing differences between groups. A contrast set is a conjunction of attribute-value pairs that differ significantly in their distribution across groups. These groups are defined by a selected property that distinguishes one from the other (e.g customers who default on their mortgage versus those that don’t). In this paper, we propose a new search algorithm which uses a vertical approach for mining maximal contrast sets on categorical and quantitative data. We utilize a novel yet simple discretization technique, akin to simple binning, for continuous-valued attributes. Our experiments on real datasets demonstrate that our approach is more efficient than two previously proposed algorithms, and more effective in filtering interesting contrast sets.
1
Introduction
Discovering the differences between groups is a fundamental problem in many disciplines. Groups are defined by a selected property that distinguishes one group from the other. For example, gender (male and female students) or year of admission (students admitted from 2001 to 2010). The group differences sought are novel, implying that they are not obvious or intuitive, potentially useful, implying that they can aid in decision-making, and understandable, implying that they are presented in a format easily understood by human beings. For example, financial institutions may be interested in analyzing historical mortgage data to understand the differences between individuals who default and those who don’t. Analysis may reveal that individuals who have married have lower default rates. Contrast set mining [1] [2] [3] [4] has been developed as a data mining task which aims to efficiently identify differences between groups from observational multivariate data. The contrast set mining techniques previously proposed have all been based on a horizontal mining approach that has been restricted to categorical attributes or a limited number of quantitative attributes. In this paper, we propose a new vertical mining approach for generating contrast sets, which can be applied to any number of categorical and quantitative attributes. This technique allows simultaneous candidate generation and support counting unlike horizontal approaches, C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 359–371, 2011. c Springer-Verlag Berlin Heidelberg 2011
360
M. Simeon and R. Hilderman
and it allows for efficient pruning of the search space. A novel yet simple discretization method that is based on the statistical properties of the data values, is utilized in order to produce intervals for continuous-valued attributes. The remainder of this paper is organized as follows. In Section 2, we briefly review related work. In Section 3, we describe the contrast set mining problem. In Section 4, we provide an overview of the vertical data format and the search framework for contrast set mining. In Section 5, we introduce our algorithm for mining maximal contrast sets. In Section 6, we present a summary of experimental results from a series of mining tasks. In Section 7, we conclude and suggest areas for future work that are being considered.
2
Related Work
The STUCCO (Search and Testing for Understandable Consistent Contrasts) algorithm [1] [2] which is based on the Max-Miner rule discovery algorithm [5], was introduced as a technique for mining contrast sets. The objective of STUCCO is to find statistically significant contrast sets from grouped categorical data. It employed a breadth-first search to enumerate the search space and used the chi-squared (χ2 ) test to measure independence and employed a modified Bonferroni statistic to limit type-1 errors resulting from multiple hypothesis tests. This algorithm formed the basis for a method proposed to discover negative contrast sets [6] that can include negation of terms in the contrast set. The main difference was their use of Holm’s sequential rejective method [7] for the independence test. The CIGAR (Contrasting Grouped Association Rules) algorithm [3] was proposed as a contrast set mining technique that not only considers whether the difference in support between groups is significant, but it also specifically identifies which pairs of groups are significantly different and whether the attributes in a contrast set are correlated. CIGAR utilizes the same general approach as STUCCO, however it focuses on controlling Type II error through increasing the significance level for the significance tests, and by not correcting for multiple corrections. Contrast set mining has also been attempted on continuous data. One of the earliest attempts focussed on the formal notion of a time series contrast set [8] and proposed an efficient algorithm to discover timeseries contrast sets on timeseries and multimedia data. The algorithm utilizes a SAX alphabet [9] to convert continuous data to discrete data (discretization). Another approach utilized a modified equal-width binning interval where the approximate width of the intervals is provided as a parameter to the model [4]. The methodology used is similar to STUCCO, with the discretization process added so that it takes place before enumerating the search space.
3
Problem Definition
Let A = {a1 , a2 , · · · , an } be a set of distinct attributes. We use Q and C to denote the set of quantitative attributes and the set of categorical attributes
COSINE: A Vertical Group Difference Approach to Contrast Set Mining
361
respectively. Let V(ak ) be the set of possible values that each ak can take on. An attribute-interval pair, denoted as ak : [vkl , vkr ], is an attribute ak associated with an interval [vkl , vkr ], where ak ∈ A, and vkl , vkr ∈ V(ak ). Further, if ak ∈ C then vkl = vkr , and if ak ∈ Q, then vkl ≤ vkr . A transaction T is a set of values {x1 , x2 , x3 , · · · , xn }, where xj ∈ V(aj ) for 1 ≤ j ≤ n. A database D is a set of transactions. A database has a class F, which is a set F = {a1 , a2 , · · · , ak }, where ∀ak ∈ A and 1 ≤ |F | < |A|. A group, G, is a conjunction of distinct class attribute-interval pairs. Formally, G = {a1 : [v1l , v1r ] ∩ · · · ∩ an : [vnl , vnr ]}, ai , aj ∈ F , ai = aj , ∀i, j A quantitative contrast set, X, is a conjunction of attribute-interval pairs having distinct attributes defined on groups G1 , G2 , · · · , Gn . Formally, X = {a1 : [v1l , v1r ] ∩ · · · ∩ an : [vnl , vnr ]}, ai , aj ∈ A − F, ai = aj , ∀i, j ∃X ∩ G1 , X ∩ G2 , · · · , X ∩ Gn : Gi ∩ Gj = ∅, ∀i = j Henceforth, a contrast set refers to a quantitative contrast set. Given a contrast set, X, we define its attribute-interval set, denoted as AI(X) as the set {ai : [vil , vir ]|ai : [vil , vir ] ∈ X}. A contrast set X is called k-specific if the cardinality of its attribute-interval set, |AI(X)|, is equal to k. Given two contrast sets, X and Y , we say that X is a subset of Y , denoted as X ⊂ Y , if AI(X) ⊂ AI(Y ). The frequency of a quantitative contrast set X in D, denoted as f req(X), is the number of transactions in D where X occurs. The tidset of a contrast set, X, is the set t(X) ⊆ T , consisting of all the transactions which contain X. The diffset of a contrast set, X, is the set d(X) ⊆ T , consisting of all the transactions which do not contain X. The support of X for a group Gi , denoted as supp(X, Gi), is the percentage of transactions in the database that belong to Gi where X occurs. A contrast set is called maximal if it is not a subset of any other contrast set. A contrast set, X, is called a group difference if and only if the following four criteria are satisfied: ∃ijsupp(X, Gi) = supp(X, Gj )
(1)
max |supp(X, Gi ) − supp(X, Gj )| ≥
(2)
f req(X) ≥ σ
(3)
ij
n
max i
supp(Y, Gi ) supp(X, Gi )
≤ κ,
(4)
where is a threshold called the minimum support difference, σ is a minimum frequency threshold, κ is a threshold called the maximum subset support ratio,
362
M. Simeon and R. Hilderman Table 1. Dataset TID 1 2 3 4 5
A 1 0 1 1 0
B 0 1 1 0 0
C 1 1 0 1 0
D 1 0 0 1 1
E 1 1 1 1 1
and Y ⊂ X with |AI(Y )| = |AI(X)| + 1. The first criterion ensures that the contrast set represents a true difference between the groups. Contrast sets that meet this criterion are called significant. The second criterion ensures the effect size. Contrast sets that meet this criterion are called large. The third criterion ensures that the contrast set occurs in a large enough number of transactions. Contrast sets that meet this criterion are called frequent. The fourth criterion ensures that the support of the contrast set in each group is different from that of its superset. Contrast sets that meet this criterion are called specific. The task of finding all group differences from the set of all contrast sets becomes prohibitively expensive because of a possibly exponentially sized search space. However, a more manageable task is to find the set of maximal group differences. Our goal then is to find all the maximal group differences in a given dataset(i.e, all the maximal contrast sets that satisfy Equations 1, 2, 3, and 4).
4 4.1
Background Data Format
Our algorithm uses a vertical data format given that we manipulate the tidsets in determining the frequency of the contrast sets. Mining algorithms using the vertical format have been shown to be very effective and usually outperform horizontal approaches [10] [11]. We utilize specifically diffsets which have been shown to substantially improve the running time of algorithms that use it instead of the traditional tidsets [11] [12]. 4.2
Search for Quantitative Contrast Sets
Our algorithm uses a backtracking search paradigm in order to enumerate all maximal group differences. Backtracking algorithms are useful because they allow us to iterate through all the possible configurations of the search space. Consider a sample dataset shown in Table 1 with five attributes, A, B, C, D, and E, each taking on values of 0 and 1 indicating absence and presence, respectively, in a transaction. Each transaction is identified by a TID. The full search space tree is shown in Figure 1. The root of the tree corresponds to the combine set {A, B, C, D, E}, which is composed of the 1-specific contrast sets from the items shown in Table 1.
COSINE: A Vertical Group Difference Approach to Contrast Set Mining
363
Fig. 1. Search Tree: Square indicates maximal contrast sets
All these contrast sets share the empty prefix in common. The leftmost child of the root consists of all the subsets containing A as the prefix, i.e. the set {AB, AC, AD, AE}, and so on. A combine set lists the contrast sets that the prefix can be extended with to obtain a new contrast set. Clearly no subtree of a node that fails to satisfy Equations 1, 2, 3, and 4 has to be examined. The main advantage of this approach is that it allows us to break up the original search space into independent sub-problems. The subtree rooted at A can be treated as a completely new problem such that the contrast sets under it can be enumerated, prefixed with the contrast set A, and so on. Formally, for a set of contrast sets with prefix P , [P ] = {X1 , X2 , · · · , Xn }, the intersection of P Xi with all of P Xj with j > i is performed to obtain a new combine set [P Xi ] where the contrast set P Xi Xj meets Equations 1, 2, 3, and 4. For example, from [A] = {B, C, D, E}, we obtain [AB] = {C, D, E}, [AC] = {D, E}, [AD] = {E}, [AE] = {} for the next level of the search tree. A node with an empty combine set such as [AE] need not be explored further. 4.3
Distribution Difference
We utilize an interestingness measure, referred to in this paper as the distribution difference, which measures how different the group support in the contrast set is from the entire dataset [4]. Formally, the distribution difference of a contrast set, X, is n n(X, Gi ) N Distribution Dif f erence(X) = × − 1 n(X) n(Gi ) i where n is the number of groups, n(Gi ) is the number of transactions that belong to Gi , n(X) is the number of transactions where X occurs, and n(X, Gi ) is the number of transactions in group, Gi , where X occurs.
364
5
M. Simeon and R. Hilderman
Our Proposed Approach
In this section we introduce our approach to contrast set mining using a vertical approach and describe it using the dataset in Table 1. 5.1
Tests for Significance
Like STUCCO, in order to determine if a contrast set is significant we use a 2×G contingency table where the row represents the truth of the contrast set, and the column indicates group membership. We use the standard test for independence of variables in contingency tables, the χ2 statistic. To correct for small sample sizes (i.e, less than 1000), we use Fisher’s exact test when the number of groups is two, and Yates correction otherwise. Also like STUCCO, we use a Bonferroni-like adjustment to reduce the number of false discoveries. 5.2
Comparison of Contrasting Groups
In determining statistical significance, when we reject the null hypothesis, we can conclude that a significant difference exists amongst the groups. When there are only two groups, we know that that differences lies between ”Group 1 and not Group 1 (i.e., Group 2)”. However, when there are more than two groups, we do not have enough information to determine specifically amongst which groups the differences lie. We use a set of 2 × 2 contingency tables representing the absence and presence of each group and determine with which pairs there is a significant difference. This is referred to as the one versus all approach. Formally, with the one versus all approach, for a contrast set X, where ∃iP (X|Gi ), we determine P (X|Gi ) = P (X|¬Gi ), ∀i. 5.3
(5)
Discretization
In order to determine intervals for quantitative attributes, we use a discretization approach to determine the endpoints of the interval. Our algorithm uses statistical properties of the values, (i.e., the mean and standard deviation) to determine where an interval begins and ends. This makes our approach simple, akin to simple binning methods, which use a fixed number of intervals, yet more responsive to the distribution of the values in determining the number of intervals. Our Discretize algorithm shown in Algorithm 1 takes a set of values for a quantitative attribute and returns a list of cut-points. The algorithm starts by sorting the values in ascending order. The minimum, maximum, mean and standard deviation, Vmin , Vmax , Vmean , Vsd , respectively, are determined. Vmean is the first cut-point. The algorithm finds cut-points within a half a standard deviation away from the minimum and maximum values. For example, assume that the maximum and minimum values for an attribute in a set of transactions are 19.4 and 45.8, respectively, with a mean of 28.5 and a standard deviation of 3.5. Lcp would be (28.5-3.5=25.0), and Rcp would be
COSINE: A Vertical Group Difference Approach to Contrast Set Mining
365
Algorithm 1. Discretize Algorithm Input: A set of values V Output: A list of cut-points C 1: Discretize(V ) 2: C = ∅ 3: Sort V 4: Calculate Vmin , Vmax , Vmean , Vsd 5: Lcp = Sm − sd 6: Rcp = Sm + sd 7: while Lcp ≥ Vmin + 0.5 × Vsd do 8: C = C ∪ Lcp 9: Lcp = Lcp − sd 10: end while 11: while Rcp ≤ Vmax − 0.5 × Vsd do 12: C = C ∪ Rcp 13: Rcp = Rcp + sd 14: end while
(28.5+3.5=32.0) initially. Since both values are greater than a standard deviation away from the minimum and maximum values, they are added to C. The process is repeated, generating additional cut-points of 21.5, 35.5, 39, and 42.5. 5.4
Mining Maximal Group Differences
In order to find all the maximal group differences in a given dataset, i.e all the quantitative contrast sets that satisfy Equations 1, 2, 3, and 4, we present our algorithm, COSINE(Contrast Set Exploration using Diffsets), in Algorithm 2. It adapts several tenets of the back-tracking search technique first proposed in [11] for contrast set mining. COSINE begins by first determining all the 1-specific quantitative contrast sets from the V of each attribute in the dataset not in the class F , and storing them in B (lines 1-6). Attributes which are quantitative are discretized using our Discretize Algorithm, to determine a V set from which 1-specific quantitative contrast sets can be generated. For each element in B, COSINE determines their diffset, Dx , their frequency, Fx , and the cardinality of their potential combine set, Cx . It then uses a one versus all approach to determine with which specific groups the differences lie, then adds the contrast sets that satisfy Equations 1, 2, and 3 into a combine set C0 (lines 8-14). C0 is then sorted in ascending order of the cardinality of Cx , then by the frequency, Fx (line 17). Using these two criteria to order the combine set has been shown to more likely eliminate many branches in the search tree from consideration and to produce a smaller backtracking tree [11]. COSINE then calls a subroutine, MINE, presented in Algorithm 3, with C0 , M , which will hold all our maximal group differences at the end, and the prefix, P0 (line 18). If we consider the example in Figure 1, COSINE starts at the root of the tree with P0 = ∅, and with {A, B, C, D, E}, sorted as {E, D, C, B, A} as C0 .
366
M. Simeon and R. Hilderman
Algorithm 2. COSINE(D, F) Input: Input: Dataset D and class F Output: The set of all maximal group differences M 1: for each i ∈ A, A ∈ D, i ∈ F do 2: if i ∈ Q then 3: V(i) = Discretize(i) 4: end if 5: B = B ∪ V(i) 6: end for 7: C0 = {} 8: for each x ∈ B do 9: Determine Dx , Fx , and |Cx | 10: if significant(x) & large(x) & frequent(x) then 11: Determine P (x|Gi ) = P (x|¬Gi ), ∀i 12: C0 = C0 ∪ {x} 13: end if 14: end for 15: Sort each C0 in increasing |Cx | then in increasing Fx 16: MINE(P0 , C0 , M ) 17: return M
MINE first determines Pl+1 , which is simply x. Secondly, it determines a new possible set of combine elements for Pl+1 , Hl+1 , by first stripping the prefix Pl+1 from the previous prefix Pl , creating Pl+1 . It then determines from the list of elements in Cl , those which are greater than (appear after) Pl+1 . For any such element, y, MINE strips it of the prefix Pl , creating y . It then checks whether the attribute-interval set of Pl+1 and y are different. Pl+1 and y are 1-specific contrast sets and if they have the same attribute-interval set, it means they originate from the same attribute and cannot be part of a new contrast set, as we require contrast sets to have unique attributes. If they are not equal, y is added to Hl+1 (lines 4-12). In our example, P1 = {E}, and since P0 = {}, then H1 = {D, C, B, A}. MINE next determines if the cardinality of the current set of maximal contrast sets, Ml , is greater than zero. If it is, MINE checks if Pl+1 ∪ Hl+1 is subsumed by an existing maximal set. If yes, the current and subsequent contrast sets in Cl can be pruned away (lines 13-17). If not, an extension is necessary. MINE then creates a new combine set, Cl+1 ,by combining the prefix Pl+1 with each member y of the possible set of combine elements, Hl+1 , to create a new contrast set z. For each z, it calculates its diffset, Dz , and its frequency, Fz , then determines whether Equations 1, 2, 3, and 4 are satisfied. Each combination, z, that satisfies the criteria is added to a new combine set Cl+1 (lines 20-27). Cl+1 is sorted in increasing order of the frequency of its members. Re-ordering a combine set in increasing order of frequency has been shown to more likely produce small combine sets at the next level [12]. This suggests that contrast sets with a lower
COSINE: A Vertical Group Difference Approach to Contrast Set Mining
367
Algorithm 3. MINE(Pl , Cl , Ml ) 1: for each x ∈ Cl do 2: Pl+1 = {x} 3: Hl+1 = ∅ 4: Let Pl+1 = Pl+1 − Pl 5: for each y ∈ Cl do 6: if y > Pl+1 then 7: Let y = y − Pl 8: if AI(y ) = AI(Pl+1 ) then 9: Hl+1 = Hl+1 ∪ {y} 10: end if 11: end if 12: end for 13: if |Ml | > 0 then 14: if Z ⊇ Pl+1 ∪ Hl+1 : Z ∈ Ml then 15: return 16: end if 17: end if 18: LMl+1 = ∅ 19: Cl+1 = ∅ 20: for each y ∈ Hl+1 do 21: z = Pl+1 ∪ {y} 22: Determine Dz , and Fz 23: if significant(z) & large(z) & frequent(z) & specific(z) then 24: Determine P (x|Gi ) = P (x|¬Gi ), ∀i 25: Cl+1 = Cl+1 ∪ {z} 26: end if 27: end for 28: Sort Cl+1 by increasing Fz , ∀z ∈ Cl+1 29: if Cl+1 = ∅ then 30: if Z ⊇ Pl+1 : Z ∈ Ml then 31: Ml = Ml ∪ Pl+1 32: end if 33: else 34: Ml+1 = {M ∈ Ml : x ∈ M } 35: end if 36: if Cl+1 = ∅ then 37: MINE(Pl+1 , Cl+1 , Ml+1 ) 38: end if 39: Ml = Ml ∪ Ml+1 40: end for
frequency at one level are less likely to produce contrast sets that meet our frequency threshold on the next level. In our example, M1 = ∅, and C1 = {ED, EC, EB, EA}. After creating the new combine set, Cl+1 , if it is empty and Pl+1 is not a subset of any maximal contrast set in Ml , Pl+1 is added to Ml (lines 29-32). Otherwise, a new set of local maximal contrast sets, Ml+1 , is created based on
368
M. Simeon and R. Hilderman Table 2. Dataset Description
Data Set Description # Transactions # Attributes # Groups Census Census data 32561 14 2 Mushroom Mushroom characteristics 8124 22 2 Thyroid Thyroid disease data 7200 21 3 Pendigits Handwritten digits 10992 16 10
the notion of progressive focusing [11] [12], whereby only the contrast sets in Ml that contain all the contrast sets in Pl are added to Ml+1 (line 34). This allows the number of maximal contrast sets of interest to be narrowed down as recursive calls are made. If Cl+1 is not empty, MINE is called again with Pl+1 , Cl+1 , and the set of new local maximal contrast sets, Ml+1 (lines 36-38). After the recursion completes, the set of maximal contrast sets, Ml , is updated with the elements from Ml+1 (line 39). From our example, since |C1 | = ∅, we skip the superset check, and create M1 = {}. In our example, COSINE calls MINE with E, and {ED, EC, EB, EA}. This process continues until all the maximal contrast sets are identified.
6
Experimental Results
In this section, we present the results of an experimental evaluation of the COSINE algorithm which was implemented in Java and run on an Intel dual core processor with 4GB of memory. Discovery tasks were performed on four datasets obtained from the UCI Machine Learning Repository [13]. The characteristics of the four datasets are shown in Table 2. 6.1
Efficiency Evaluation
We ran a series of discovery tasks on the Census, Mushroom, Thyroid, and Pendigits datasets in order to compare the efficiency of COSINE with that of STUCCO and CIGAR. We implemented STUCCO and CIGAR in the same language and ran it on the same platform as COSINE. Although they each have different objectives and thus place different constraints on the search process, STUCCO, CIGAR, and COSINE all use the support difference as a constraint, thus we can measure the time taken to complete the discovery task as the support difference varies. We ran STUCCO, CIGAR, and COSINE, using a significance level of 0.95, on the four datasets, of which the Mushroom dataset and a subset of the Census dataset were utilized in [1] and [3]. Figure 2 shows the results comparing the run time to the minimum support difference. We use a minimum frequency threshold of 0 and a maximum subset support ratio of 0 for COSINE. The results have been averaged over 10 consecutive runs. We use the same parameters for CIGAR as outlined in [3] for these datasets. We also ran COSINE without controlling for Type I errors, referred to as COSINE-1 in Figure 2.
COSINE: A Vertical Group Difference Approach to Contrast Set Mining
200
COSINE STUCCO CIGAR COSINE-1
500 400
150
Time(s)
Time(s)
600
COSINE STUCCO CIGAR COSINE-1
369
100
300 200
50
100
0
0 0
5
10
15
20
25
30
0
10
Support Difference(%)
(a) Census 10
20
40
50
COSINE STUCCO CIGAR COSINE-1
18 16
7
Time(s)
Time(s)
8
30
(b) Mushroom
COSINE STUCCO CIGAR COSINE-1
9
20
Support Difference(%)
6 5
14 12 10
4
8
3 2
6 0
5
10
15
20
25
30
Support Difference(%)
(c) Thyroid
0
5
10
15
20
25
30
35
40
Support Difference(%)
(d) Pendigits
Fig. 2. CPU time versus Support Difference
On all four datasets, both COSINE and COSINE-I outperformed STUCCO and CIGAR. This observation was most acute on the Mushroom dataset when the minimum support difference is 0. Above a minimum support difference of 10, there is no difference in runtime amongst STUCCO, COSINE, and COSINE-I. The run time observed for STUCCO on the Mushroom dataset is consistent with that in [1]. On both the Thyroid and Census datasets, the difference in runtime becomes negligible as the minimum support difference increases, while on the Pendigits dataset, the difference in runtime even at the largest support difference measured, 30, is substantially different between STUCCO and COSINE. For all four datasets, CIGAR consistently has the longest runtime. 6.2
Interestingness Evaluation
In this section, we examine the effectiveness of the maximum subset ratio in terms of the interestingness of the contrast sets that are discovered. Table 3 shows the average distribution difference of the maximal contrast sets discovered for each of the four datasets as the maximum subset support ratio is varied. These results were generated with a minimum frequency threshold of 0, significance level of 0.95, and a minimum support difference of 0. For each of the four datasets, as the maximum subset support ratio is varied from 0 to 0.5, we can observe an increase in the average distribution difference
370
M. Simeon and R. Hilderman Table 3. Effectiveness of the Maximum Subset Support Ratio Data Set Distribution Difference Maximum Subset Ratio 0 0.01 0.05 0.1 0.5 Census 0.35 0.45 1.37 2.17 2.54 Mushroom 0.87 1.23 1.45 2.01 2.76 Thyroid 1.24 1.55 1.87 2.98 3.21 Pendigits 1.98 2.34 2.87 3.41 3.65
of the contrast sets discovered. This indicates that the contrast sets discovered have a substantially different distribution amongst the groups, than that of the entire dataset, thus they are interesting. Thus the maximum subset support ratio serves as a good filter for producing interesting contrast sets.
7
Conclusion
In this paper, we introduced and demonstrated an approach for mining maximal group differences: COSINE. COSINE mined maximal contrast sets that are significant, large, frequent and specific from categorical and quantitative data and utilized a discretization technique for continuous-valued attributes by using their mean and standard deviation to determine the number of intervals. We compared our approach with two previous contrast set mining approaches, STUCCO and CIGAR, and found our approach to be more efficient. Finally, we showed that the maximum subset support ratio was effective in filtering interesting contrast sets. Future work will examine further search space reduction techniques.
References 1. Bay, S.D., Pazzani, M.J.: Detecting change in categorical data: Mining contrast sets. In: KDD, pp. 302–306 (1999) 2. Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov. 5(3), 213–246 (2001) 3. Hilderman, R., Peckham, T.: A statistically sound alternative approach to mining contrast sets. In: AusDM, pp. 157–172 (2005) 4. Simeon, M., Hilderman, R.J.: Exploratory quantitative contrast set mining: A discretization approach. In: ICTAI, vol. (2), pp. 124–131 (2007) 5. Bayardo Jr., R.J.: Efficiently mining long patterns from databases. In: SIGMOD Conference, pp. 85–93 (1998) 6. Wong, T.T., Tseng, K.L.: Mining negative contrast sets from data with discrete attributes. Expert Syst. Appl. 29, 401–407 (2005) 7. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70 (1979) 8. Lin, J., Keogh, E.J.: Group SAX: Extending the notion of contrast sets to time series and multimedia data. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 284–296. Springer, Heidelberg (2006)
COSINE: A Vertical Group Difference Approach to Contrast Set Mining
371
9. Lin, J., Keogh, E.J., Lonardi, S., chi Chiu, B.Y.: A symbolic representation of time series, with implications for streaming algorithms. In: DMKD, pp. 2–11 (2003) 10. Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: VLDB, pp. 432–444 (1995) 11. Zaki, M.J., Gouda, K.: Fast vertical mining using diffsets. In: KDD, pp. 326–335 (2003) 12. Gouda, K., Zaki, M.J.: Genmax: An efficient algorithm for mining maximal frequent itemsets. Data Min. Knowl. Discov. 11(3), 223–242 (2005) 13. Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Hybrid Reasoning for Ontology Classification Weihong Song1,2 , Bruce Spencer1,2 , and Weichang Du1 1
Faculty of Computer Science, University of New Brunswick 2 National Research Council, Canada {song.weihong,bspencer,wdu}@unb.ca
Abstract. Ontology classification is an essential reasoning task for ontology based systems. Tableau and resolution are two dominant types of reasoning procedures for ontology reasoning. Complex ontologies are often built on more expressive description logics and are usually highly cyclic. When reasoning complex ontologies, the both approaches may have difficulties in terms of reasoning results and performance, but for different ontology types. In this research, we investigate a hybrid reasoning approach, which will employ well-defined strategies to decompose and modify a complex ontology into subsets of ontologies based on capabilities of different reasoners, process the subsets with suitable individual reasoners, and combine such individual classification results into the overall classification result. The objective of our approach is to detect more subsumption relationships than individual reasoners for complex ontologies, and improve overall reasoning performance. Keywords: Hybrid reasoning, Complex ontology, Classification, Tableau, Resolution.
1
Challenge for Ontology Classification
Ontology classification, which means computing the subsumption relation between all pairs of concepts, is the foundation for other ontology reasoning problems. Our task in this paper is for classification,i.e., Tbox[1] reasoning. We consider an ontology to be complex in one of the two cases: it uses an expressive language, or it is highly cyclic. Ontologies use Description Logic (DL)to define concepts, and more complex ontologies require more expressive languages. For example, SROIQ(D)is more expressive than SHIQ. For the second situation, definitions are cyclic when concepts are defined in terms of themselves or in terms of other concepts that indirectly refer to them. When numerous concepts in the ontology are cyclic, we say the ontology is highly cyclic. These two situations are independent; an ontology may be cyclic but use a simple DL, or it may use a complex DL but be acyclic. We often encounter ontologies, that exhibit both the complexities mentioned: expressive DL language and highly cyclic. Due to the functional and performance issues, such ontologies often cannot be classified by any available reasoners individually, in terms of functionality and/or machine capacity. If they can, they require powerful computers with large memory and much time. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 372–376, 2011. c Springer-Verlag Berlin Heidelberg 2011
Hybrid Reasoning for Ontology Classification
373
Table 1. Results of Performance Evaluation (Mem: memory; T: time; C/I: C:complete result, I: incomplete result; M: Mbytes; S:second; -: the reasoner failed to return result) Ontology
Hermit Mem(M) T(S) Dolce all 28.77 202.91 SnomedCT 4403.20 3420.02 Galen-Heart 4408.00 190800.05 Galen-Full FMAC 3891.20 3494.00 FMA -
2
C/I C C C C -
Pellet Mem(M) T(S) 146.48 132.00 8704.01 1200.00 -
C/I C C -
CB Mem(M) T(S) 12.50 0.10 767.00 62.00 21.00 0.91 947.00 66.78 212.00 11.40 580.11 32.66
C/I I C C C I I
Problem: Limitations of the DL Reasoning Procedures
Next, we illustrate functional and performance issue of the two reasoning procedures. First we depict the Functional Problem. The tableau based procedures are useful for both simple and expressive DL languages, including such expressive as SROIQ(D). However, the tableau is not able deal with ontologies whose concepts are highly cyclic. The representative reasoners using tableau procedure are Pellet [6], Hermit [5]; Resolution based reasoning procedures can only cope with a smaller subset and less expressive DL, such as SHIQ. However, they are very effective for second kind of complexity, and can handle highly cyclic ontologies. CB [3], KAON2 [4] are examples of reasoners applying resolution. If an ontology has both kinds of complexity, it is very hard for a single reasoner to classify it. FMA is such an example, which is one of the largest and most complex medical ontology. None of current reasoners can fully process it because of its two types of complexities. Even its subset FMA-constitutionalPartForNS(FMAC)[2], can’t be fully classified by any current resolution reasoner because its DL language is beyond SHIQ; None of current tableau reasoners can handle the biomedical ontology Gallen-full, because it is highly cyclical. The functional problem is one of the dominant problems that prevent the reasoning technology from being widely used. Table1 illustrates all the results using a powerful computer R (Intel Xeon(R)CP U 8-core, 40G memory, 64-bit OS). Another major problem is memory space and time performance problems. Resolution reasoners are often significantly faster and also use much less memory than tableau reasoners for the same task.Tableau-based reasoning builds large structures which greatly consume memory space. For the large ontology SNOMED-CT, as well as medium size but complex ontology FMAC and Galen-Heart, the tableau reasoner need huge memory to deal with them, from 3891.00MB to 8704.01MB, this size of memory is not available for common PC. As of time efficiency, CB took about 1/20 and 1/50 of time than Pellet and Hermit respectively when processing on Snomed-CT. As for ontology Gallen-Heart, CB only took 1/190800 of time of Hermit.
374
3
W. Song, B. Spencer, and W. Du
Proposed Solution: Hybrid Reasoning
We propose a hybrid reasoning approach to classify on complex ontologies. Each kind of reasoning procedure and its corresponding reasoners has its advantages and limitations. We propose to assemble the two kinds of reasoning technology together, to accomplish complex reasoning tasks, which cannot be done with acceptable performance by any individual reasoner. An ontology is composed of many concepts; definitions of a concept consist of many axioms. The basic idea of our approach is to separates the ontology based on the reasoning ability of different reasoners, i.e., separate the two sorts of complexities into different pieces of ontologies and assign them to the more capable or efficient reasoner to handle. More elaboration is as follows: Suppose we have a complex ontology T , a resolution-based reasoner Rr and a tableaubased reasoner Rt . We construct Tr , which is derived from T by removing axioms that contain elements from languages that are more expressive and beyond Rr’s capability, with the result that Tr is completely classified by Rr . But this result may just a partial classification result for T . After that, we use Rt to do a second round of classification on certain selected concepts C whose classification results may be affected by removing axioms in T − Tr . The purpose of this idea is reducing the number of concepts that tableau reasoner should work on to compensate Rr ’s incomplete result and thus get the sound and complete results plus better overall performance. It is possible that C still be highly cyclic. Then, similar thoughts of simplifying the ontology for Tr can be employed on Tt , which is a subset of C’. This time we remove a different source of complexity; we will remove some axioms in C’ which lead to many definitions cycles to construct the simplified ontology fragment Tt ; we might also use the previous result of reasoning and inject some subsumption results to Tt . Then, we will let Rt do reasoning on Tt and get more classification results. Further iteration rounds may be required; each time we will use the previous classification result. The approach is based on the following observation. Proposition 1. Let T be a terminology consisting a subset of the axioms in a consistent terminology T . For any p, if T entails p, then T entails p. Based on this proposition, all the subsumption relationships we get from reasoning on Tr holds on the entire terminology T . And if we further inject some of these sound results to the ontology, the future result will also be sound. However,there may be other subsumption relationships that we neglected because of removing axioms. In other words, the subsumptions we get from Rr or Rt is sound but not complete, and we need more rounds of classification using Rt and Rr to make the results complete. The benefits of this hybrid reasoning approach lie in two aspects: 1. Functional aspect: Compared with resolution based reasoners, our approach is able to do reasoning on a more expressive DL. Compared with Tableaubased reasoners, the proposed approach is able to do classification correctly on some highly cyclic concepts.
Hybrid Reasoning for Ontology Classification
375
2. Performance aspect: Our approach outperforms tableau-based reasoners since we allocate some classification tasks to the more efficient resolution-based reasoners.
4
Challenge and Potential Impacts for This Proposal
Our challenge is to identify and solve new problems that arise by applying our approach so that the hybrid reasoner can be effective. 1. How to reduce the size of C’. When we remove an axiom from the definitions of a concept A, the basic way to identify the concepts C’ being affected by this moving is to find all the concepts which depend on A and their transitive dependents. According to the concrete axiom moved and the different kinds of dependency relationship with A in C’, the challenge is find some concepts that in fact will not be affected by this moving action, and further eliminate the concepts in set C’. 2. How to break cycles in Tt . We need to define various properties of the cyclical situation, including the number of concepts contributing to a cycle, and the number of cycles a concept contributes to. We also need a strategy to identify these properties and to study their effects on tableau reasoners. The challenge lies in the problem that how to choose the axioms to be removed and how to inject previous reasoning result and get Tt so that it has minimum negative effects on the completeness of classification result because of moving axioms, while the cycles have been reduced and the Rt can classify the Tt and supplement more classification result. 3. How to ensure completeness of the classification results through the combination of different strategies while improving efficiency of the reasoning in most situations. Current reasoners are prevented from classifying ontologies in many cases. By using hybrid reasoning, an ontology’s classification is not limited to one particular reasoner but the task is given to a combination of different reasoners. This strategy is adaptable to different language features and in many cases the overall performance is enhanced compared to a single reasoner.
References 1. Baader, F.: The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge Univ. Pr., Cambridge (2010) 2. Glimm, B., Horrocks, I., Motic, B.: Optimized Description Logic Reasoning via Core Blocking. Automated Reasoning, 457–471 (2010) 3. Kazakov, Y.: Consequence-Driven Reasoning for Horn SHIQ Ontologies. In: Proc. of IJCAI 2009, pp. 2040–2045 (2009)
376
W. Song, B. Spencer, and W. Du
4. Motik, B., Studer, R.: KAON2CA Scalable Reasoning Tool for the Semantic Web. In: Proceedings of the 2nd European Semantic Web Conference (ESWC 2005), Heraklion, Greece (2005) 5. Shearer, R., Motik, B., Horrocks, I.: HermiT: a Highly-Efficient OWL Reasoner. In: Proceedings of the 5th International Workshop on OWL: Experiences and Directions (OWLED 2008), pp. 26–27, Citeseer (2008) 6. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A Practical OWL-DL Reasoner. Web Semantics: Science, Services and Agents on the World Wide Web 5(2), 51–53 (2007)
Subspace Mapping of Noisy Text Documents Axel J. Soto1 , Marc Strickert2 , Gustavo E. Vazquez3 , and Evangelos Milios1 1
Faculty of Computer Science, Dalhousie University, Canada [email protected] 2 Institute for Vision and Graphics, Siegen University, Germany 3 Dept. Computer Science, Univ. Nacional del Sur, Argentina
Abstract. Subspace mapping methods aim at projecting high-dimensional data into a subspace where a specific objective function is optimized. Such dimension reduction allows the removal of collinear and irrelevant variables for creating informative visualizations and task-related data spaces. These specific and generally de-noised subspaces spaces enable machine learning methods to work more efficiently. We present a new and general subspace mapping method, Correlative Matrix Mapping (CMM), and evaluate its abilities for category-driven text organization by assessing neighborhood preservation, class coherence, and classification. This approach is evaluated for the challenging task of processing short and noisy documents. Keywords: Subspace Mapping, Compressed Document Representation.
1
Introduction
Many data-oriented areas of science drive the need for faithfully representing data containing thousands of variables. Therefore, methods for considerably reducing the number of variables are desired focusing on subsets being minimally redundant and maximally task-relevant. Different approaches for subspace mapping, manifold learning, and dimensionality reduction (DR) were proposed earlier [1,2]. A current challenge in information representation is the huge amounts of text documents being produced at increasing rates. Using the well-known vector space representation, or “bag of words” model, a corpus of documents is described by the set of words that each document contains. This approach yields a document-term matrix containing thousands of unique terms, and thus is very likely to be sparse. The text mining communities have developed methods for automatic clustering and classification of document topics using specific metrics and kernels. Yet fully developed human-in-the-loop approaches are rare to enable the user to perform visual data exploration and visual data mining. While the automatic learning of data is crucial, visualization is another key aspect for providing an intuitive interface to contained information and for interactive tuning of the C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 377–383, 2011. c Springer-Verlag Berlin Heidelberg 2011
378
A.J. Soto et al.
data/text mining algorithms. This makes DR methods indispensable for interactive text corpus exploration. In this paper, we present an application of a recent DR method, Correlative Matrix Mapping (CMM), which has been successfully applied in other domains [3]1 in the context of regression problems. This method is based on an adaptive matrix metric aiming at a maximum correlation of all pairwise distances in the generated subspace and the associated target distances. Preliminary work [4]1 showed some capabilities of this approach for the expert-guided visualization of labeled text corpora by integrating user feedback on the base of the interpretable low-dimensional mapped document space. Here, we provide a comprehensive comparison of CMM and other competitive DR methods for creating representative low-dimensional subspaces. Since machine learning methods rely on distance calculations, we investigate how such projections with label-driven distance metrics can improve representations of short and noisy text documents. We refer to noisy documents as the ones that are not properly written in terms of spelling and grammatical structure. Such documents are quite common in business environments such as aircraft maintenance records, online help desk or customer survey applications, and their analysis is thus highly relevant. Still, much work in the information extraction literature is focused on well-formed text documents.
2
Correlative Matrix Mapping (CMM)
Given n m-dimensional data vectors xj ∈ X ⊂ Rm , 1 ≤ j ≤ n, such that each xj is associated to a q-dimensional vector lj ∈ L ⊂ Rq . For text corpora, n is the number of labeled documents in the corpus, m is the number of terms in the corpus and lj is the vector representation of the label of the document xj . CMM aims at finding a subspace of X where the pairwise distances Dλ X are in maximum correlation with those on the label space (DL ). Thus, pairwise distances in the document-term space are sought to be in maximum correlation with those distances on the label space. Here, DL is used as the Euclidean distance on the label space and the λ superscript in Dλ X indicates parameters of the adaptive T λ i j T distance (DX )i,j = ((x − x ) · λ · λ · (xi − xj ))1/2 , where λ is an m × u matrix, and u is specified by the user. This distance matrix metric resembles a MahaT lanobis distance, where Λ = λ · λ has a rank of u. We obtain the parameter matrix as λ = arg max r(DL , Dλ (1) X) ∗ λ
where r is the Pearson correlation. Locally optimal solutions for (1) can be obtained by gradient methods using its derivative with respect to λ [3]. It is worth noting that while the number of rows of the λ matrix is constrained by the number of terms in X, i.e. the document vector dimensionality, the number of columns u, i.e. the dimensionality of the subspace, is defined by the user. Note 1
CMM was called differently in previous works but created naming conflicts therein.
Subspace Mapping of Noisy Text Documents
379
T
that λ · X defines a u-dimensional subspace that is an informative representation of the input space focused on its label association. If visualization is the ultimate goal, a choice of u ≤ 3 is recommended. New documents with unknown labels can be also projected to the new space by using the optimized λ matrix. An open source package with the implementation of CMM is available at [5].
3
Experiments
We selected four alternative DR methods that make use of label information and allow exact out-of-sample extension, and we used them to compare to CMM. Linear Discriminant Analysis (LDA) aims at finding optimal discriminant directions by maximizing the ratio of the between-class variance to the withinclass variance [7]. Since its solutions require inverses of covariance matrices, it usually has ill-conditioning problems for high-dimensional data. Therefore, we also calculate a simplification of this approach based on the diagonal matrices of the covariance matrices, referred to as LDAd . Canonical Correlation Analysis (CCA) is a well-known technique for finding correlations between two sets of multidimensional variables by projecting them onto two lower-dimensional spaces in which they are maximally correlated [8]. Although this method is strongly related to CMM in the sense that both look for optimal correlations, CMM does not adapt data and labels space, but adapts distances in the data space. Neighborhood Component Analysis (NCA) aims at learning a linear transformation of the input space such that the k-Nearest Neighbors method performs well in the transformed space [9]. The method uses a probability function to estimate the probability pi,j that a data point i selects a data point j as its neighbor after data mapping. The method maximizes the expected number of data points correctly classified under the current transformation matrix. Maximally Collapsing Metric Learning (MCML) aims at learning a linear mapping where all points in the same class are mapped into a single location, while all points in other classes are mapped to other locations, i.e. as far as possible among data points of different classes [10]. This algorithm uses a probabilistic selection rule as in NCA. However, unlike NCA the optimization problem is convex, and thus MCML transformation can be completely specified from the objective function. The Matlab Toolbox for Dimensionality Reduction [2] was used for all methods except for CCA taken from the Statistics Toolbox [6]. 3.1
Data
We used the publicly available Aviation Security Reporting System (ASRS) Database Report Set [11] and extracted the narrative fields of documents belonging to 4 out of 24 topics: Bird or animal strike records, Emergency medical service incidents, Fuel management issues and Inflight weather encounters. Each topic has 50 documents, thus providing a total of 200 documents. 6048 rare terms were discarded yielding 1829 unique terms. Two major challenges are faced. First, the
380
A.J. Soto et al.
average length of each document is only a few sentences, which makes it difficult to extract statistically significant terms. Second, texts are riddled with acronyms, ad hoc abbreviations and misspellings. Binary representations are used for the document-term matrix, i.e. the component for to the k th term of the j th document xj is 0 if the term is not present and 1 otherwise. This binary weighting approach is appropriate given the short length of the documents for which the frequency of a term might inflate its importance. In the case of CCA and CMM, the four label vectors (0,0,0,1), (0,0,1,0), (0,1,0,0), and (1,0,0,0) are used for class representation, thus, inducing equidistant classes. In LDA, NCA and MCML integer values are used for class assignment, because they do not quantify label dissimilarities. For each experiment, 80% of the corpus was used for training, while the remaining documents were held-out for testing. This process was restarted 10 times, so that a new testing set was obtained in each iteration for implementing a repeated random sub-sampling validation scheme. All the applied DR algorithms showed convergence during the optimization phase, with the exception of MCML which, despite of its convex cost function, required a time-limiting stopping criterion, because of its excessive run time. Since NCA and CMM use iterative methods for optimization, different early stopping criteria were sought using a portion of the training set. Otherwise, these methods are likely to overfit training data. Since meaningful visualization is desirable for many tasks, all our experiments were deliberately constrained to 2- and 3-dimensional subspaces. 3.2
Assessing Subspace Mapping Performance
We divide different assessments applied on the methods into three types. The first aims at evaluating the embedding without using label information. Two performance metrics are used: the area under the extrusion/intrusion tendency curve (B) and neighborhood ranking preservation (Q) [12]. B quantifies the tendency to commit systematic neighborhood rank order errors for data pairs in the projection space (B is not bounded; the closer to zero, the better), while Q measures k-ary neighborhood preservation (Q varies between 0 and 1; the closer to one, the better). The second quality class considers label information namely cohesion, which is the ratio of the pairwise Euclidean distances of documents belonging to a same class to the pairwise distances of documents of different classes. The third class of assessments also uses label information. It evaluates the potential of supervised learning methods to exploit the given low-dimensional space for classification. It may be argued that the better the classification accuracy is, the better the projection is. Classifying from a lowdimensional space may produce better results due to the removal of collinear or irrelevant variables. Thereby, we used k-nearest neighbors (kNN), Decision Trees (DT), Support Vector Machines (SVM) using a Radial Basis Function kernel (rbf) and using a multi-layer perceptron kernel (mlp).
Subspace Mapping of Noisy Text Documents
4
381
Results
We will focus on the results obtained on the testing set, while the training set results are still available for the reader. Table 1 shows the average of the computed metrics of the different DR methods when they are projected into a 2D space. It can be observed that most methods have an intrusive embedding, i.e. a tendency to positive rank errors in the subspace. Not surprisingly, NCA has the highest average preservation of the k-ary neighborhoods, since this is what its mapping is trying to capture. Yet the difference is not statistically significant with CMM when a Dunnett test [13] is performed with a 1% familywise probability error. Although CMM does not have the lowest cohesion value, due to the variance of this metric, no significant difference can be drawn here. Looking at the performance of the classification methods, CMM significantly outperforms all the other methods with the exception of NCA with the kNN method. Nevertheless, no significant difference was found between CMM and NCA with kNN. Results for the 3D projection show a very similar behavior as the one showed for the projections into the 2D space (Table 2). We can see that LDAd has a good classification accuracy when DT are used, although no significant differences with CMM and NCA were found. We can also see that CMM made an improvement on most of the metrics. In summary, LDA and CCA had poor performances in most metrics. These methods compute their optimal value in closed-form (using eigenvectors or inverse of covariance matrices), and thus the computation might get corrupted due to the large number of variables and relatively small number of documents. LDAd has better performance than LDA. However, most of the components of the parameter matrix in LDAd are zero. This yields a cluttered projection of the data points to a few locations, which is not convenient on most cases. The remarkably poor performance of MCML might be due to an underfitting situation. It is worth saying that MCML is the most compute-intensive method by far and its calculations last more than 50 times the amount of time spent on any other algorithm. Moreover, delaying its stopping criterion does not seem to dramatically improve its performance. Finally, it is important to note that Table 1. Comparison of DR methods using 2D spaces: rank-based quality measures (Q/B), cohesion, and classification accuracies of four classifiers. LDA LDAd Train Test Train Test Q B
0.534 0.563 0.515 0.015 0.049 0.007
Cohesion 0,113 0,324 0,118 kNN SVMrbf SVMmlp DT
0.842 0.238 0.237 0.884
0.250 0.258 0.230 0.268
0.626 0.238 0.249 0.652
NCA Train Test
MCML Train Test
CCA CMM Train Test Train Test
0.503 0.599 0.613 0.533 0.548 0.521 0.539 0.544 0.014 0.039 0.046 -0.058 -0.040 0.009 0.025 0.042
0.596 0.033
0,134 0,158
0,210
0.590 0.258 0.230 0.605
0,197
0,259
0,305 0,000 0,312 0,058
0.950 0.703 0.714 0.363 0.349 0.358 0.955 0.608
0.731 0.474 0.348 0.785
0.373 0.375 0.300 0.360
0.988 0.738 0.738 0.988
0.358 0.310 0.378 0.335
0.986 0.685 0.984 0.638 0.824 0.605 0.991 0.683
382
A.J. Soto et al.
Table 2. Comparison of DR methods using 3D spaces: rank-based quality measures (Q/B), cohesion and classification accuracies of four classifiers. LDA LDAd Train Test Train Test
NCA Train Test
MCML Train Test
CCA CMM Train Test Train Test
0.537 0.573 0.536 0.527 0.608 0.618 0.540 0.559 0.517 0.548 0.560 0.617 0.017 0.046 0.014 0.019 0.049 0.043 -0.071 -0.043 -0.004 0.026 0.079 0.040
Q B
Cohesion 0,085 0,324 0,149 0,173 0,157 0,202
0,253
0,298
0,006 0,314 0,081 0,213
0.750 0.310 0.328 0.653
0.773 0.755 0.411 0.828
0.450 0.405 0.305 0.415
0.991 0.811 0.833 0.991
kNN SVMrbf SVMmlp DT
0.872 0.238 0.237 0.928
0.268 0.258 0.230 0.263
0.716 0.233 0.246 0.761
0.673 0.263 0.248 0.723
0.976 0.622 0.363 0.972
0.415 0.265 0.325 0.343
0.990 0.974 0.971 0.990
0.743 0.650 0.658 0.695
CMM either on the 2D or 3D projection is the most stable method, since it gets the first or second best values for all the metrics. More specifically, CMM is the only method that has a consistent classification accuracy when SVM is used. Additional results that were not included here for a matter of space can be looked up in [14].
5
Conclusions
Subspace mapping allows visualization of high-dimensional spaces on an informative plotting space, suitable for visual data mining methods. Additionally, projections into low-dimensional spaces allow a reduction of the storage of data points and lead to improved prediction capacity of a subsequently applied supervised method. We emphasize the advantages of applying linear subspace transformations, since they provide a simple interpretation of the new space. Moreover, they guarantee exact out-of-sample extensions. Methods that make use of calculation of eigenvectors may not be the best option when the input data dimensionality is considerably high. This paper described the applicability of different DR methods for short and noisy text documents. This is the first work where CMM is compared against other well-established DR methods. From the results showed in Section 4 we can state that our proposed method CMM represents a competitive subspace mapping method, with the advantage of more stable behavior than the other methods tested in this work. NCA was its closest competitor, especially for Q and k-NN. As future work, we plan to extend this development by considering a semisupervised scenario. In this case the system can automatically classify documents and, at the same time, the user can provide its feedback about reclassifying a document or indicating the irrelevance of a term. Moreover, the system should adapt its behavior from the user feedback and correct future actions. We thank NSERC, PGI-UNS (24/ZN16), the DFG Graduate School 1564, and MINCyT-BMBF (AL0811 - ARG 08/016) for their financial support.
Subspace Mapping of Noisy Text Documents
383
References 1. Zhang, J., Huang, H., Wang, J.: Manifold Learning for Visualizing and Analyzing High-Dimensional Data. IEEE Intel. Syst. 25, 54–61 (2010) 2. van der Maaten, L., Postma, E., van den Herik, J.: Dimensionality Reduction: A Comparative Review. Tilburg University, TiCC TR 2009–005 (2009) 3. Strickert, M., Soto, A.J., Vazquez, G.E.: Adaptive Matrix Distances Aiming at Optimum Regression Subspaces. In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning - ESANN 2010, pp. 93–98 (2010) 4. Soto, A.J., Strickert, M., Vazquez, G.E., Milios, E.: Adaptive Visualization of Text Documents Incorporating Domain Knowledge. In: Challenges of Data Visualization, NIPS 2010 Workshop (2010) 5. Machine Learning Open Source Software, http://mloss.org 6. Matlab Statistics Toolbox, http://www.mathworks.com/products/statistics/ 7. McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. WileyInterscience, Hoboken (2004) 8. Hardoon, D.R., Szedmak, S.R., Shawe-Taylor, J.R.: Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Comput. 16, 2639–2664 (2004) 9. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighborhood Components Analysis. Adv. Neural Inf. Process. Syst. 17, 513–520 (2005) 10. Globerson, A., Roweis, S.: Metric Learning by Collapsing Classes. Adv. Neural Inf. Process. Syst. 18, 451–458 (2006) 11. Aviation Safety Reporting System, http://asrs.arc.nasa.gov/ 12. Lee, J.A., Verleysen, M.: Quality Assessment of Dimensionality Reduction: RankBased Criteria. Neurocomputing 72, 1431–1443 (2009) 13. Dunnet, C.W.: A Multiple Comparisons Procedure for Comparing Several Treatments with a Control. J. Am. Stat. Assoc. 50, 1096–1121 (1955) 14. Soto, A.J., Strickert, M., Vazquez, G.E., Milios, E.: Technical Report, Dalhousie University (in preparation), http://www.cs.dal.ca/research/techreports
Extending AdaBoost to Iteratively Vary Its Base Classifiers ´ Erico N. de Souza and Stan Matwin, School of Information Technology and Engineering University of Ottawa Ottawa, ON, K1N 6N5, Canada [email protected] [email protected]
Abstract. This paper introduces AdaBoost Dynamic, an extension of AdaBoost.M1 algorithm by Freund and Shapire. In this extension we use different “weak” classifiers in subsequent iterations of the algorithm, instead of AdaBoost’s fixed base classifier. The algorithm is tested with various datasets from UCI database, and results show that the algorithm performs equally well as AdaBoost with the best possible base learner for a given dataset. This result therefore relieves a machine learning analyst from having to decide which base classifier to use.
1
Introduction
In [2, 3], Freund and Schapire introduced AdaBoost, a classifier induction method that converts a “weak” PAC learner with a slightly better performance than the random classifier into a stronger, high accuracy algorithm. The final model will be the weighted sum of each “weak” classifier applied to the dataset. Freund and Schapire use only one “weak” PAC learner in the AdaBoost algorithm, and in this paper we extend this definition to allow different “weak” learners to be used in iterations. A similar approach is presented by Rodr´ıguez et al [6], that discusses a supervised classification method for time series. AdaBoost tries to improve the quality of the learner iteratively considering only part of the distribution, i.e. in each iteration it calculates a new weight for the current distribution and applies the same weak learner. This is a good approach, but we can try to improve the solution by applying a weak learner in a certain iteration, and a different one in the next, because depending on the type of the data it is possible that a certain distribution is better fitted with a different base learner. [2–4] present solutions considering only one weak classifier for all iterations, and this motivated our research in order to verify if different classifiers executed in each iteration will give an improvement. This paper presents a new algorithm, called AdaBoost Dynamic that uses different standard weak learners - like
The author is also affiliated with the Institute of Computer Science, Polish Academy of Sciences, Poland. The authors acknowledge the support of NSERC and MITACS for this research.
C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 384–389, 2011. c Springer-Verlag Berlin Heidelberg 2011
Extending AdaBoost to Iteratively Vary Its Base Classifiers
385
decision trees, neural networks, Bayesian Networks, etc - applied to different datasets from UCI [1]. The idea is to relieve the machine learning analyst from the choice of different possible base learners, letting the system iteratively and automatically define the best model for the data. This paper is organized as follows: in Section discusses the 2 AdaBoost.M1 algorithm as well the proposed modifications. Section 3 presents the experiment results comparing AdaBoost Dynamic with AdaBoost.M1 using various “weak” classifiers. Finally, some conclusions are presented in Section 4.
2
Algorithm Modification
The original AdaBoost.M1, proposed in [3], takes a training set m of examples S = (x1 , y1 ), ..., (xm , ym ), where xi is an instance drawn from some space X and represented in some manner (typically, a vector of attribute values), and yi ∈ Y is the class label of xi . The second parameter is the WeakLearner algorithm. This algorithm will be called in a series of rounds that will update a distribution Dt . The WeakLearner algorithm is generic and must be chosen by the user, respecting the requirement that it must correctly classify at least 12 of the data set. This original algorithm considers that only one WeakLearner is boosted, and the final output hypothesis is given by Hf inal = arg max t:ht (x)=y log β1t [3]. In order to use different algorithms, an array of WeakLearners was added as input to the original algorithm, and in each iteration another WeakLearner from the array is executed. The number of WeakLearners in the input array may be the same as the number of iterations, but this is not mandatory. The restriction remains, that each WeakLearner in the array must correctly classify at least 12 the data set. AdaBoost Dynamic is presented in Table 1. This is the AdaBoost.M1 algorithm with the proposed modifications. This algorithm resembles the original algorithm, except one of the inputs is a list of WeakLearners and line 3 in the algorithm calls a different WeakLearner in each iteraction. In this case, the final output will be Hf inal = arg max t:ft (hj (x))=y log β1t . This means that the new output hypothesis will be calculated considering a function ft (hj (x)) in each iteraction, where hj (x) = hj+1 (x). The function ft is defined just to vary the weak learner in each iteration.
3
Results
The algorithm was implemented in Weka[5]. This first implementation was used 10 different weak learners, executed in the following order: – – – –
Neural Network Naive Bayes implementation; Decision Stumps; Bayes Networks;
386
´ E.N. de Souza and S. Matwin Table 1. AdaBoost Dynamic Algorithm with the proposed modification
Input: sequence of examples m (x1 , y1 ), ..., (xm , ym ) with labels yi ∈ Y = {1, ..., k} list W of the WeakLearner algorithms integer T specifying number of iterations 1 1.Initialize Di = m for all i 2.Do for t=1...T 3.Call W[j], providing it with the distribution Dt 4.Get back hypothesis ht : X → Y 5.Calculate the error of ht : εt = i:ht (xi )=yi Dt (i) 6.If εt > 12 , then abort loop εt 7.Set βt = 1−ε t βt if ht (xi ) = yi 8.Update distribution Dt : Dt+1 (i) = DZt (i) × , t 1 otherwise where Zt is a normalization constant. 9.If Length(W) = j then j = 1, else j = j+1
– – – – – –
Random Tree; Random Forest; SVM (Support Vector Machine); Bagging ZeroR rules Naive Bayes Tree.
All weak learner algorithms in the list were used considering only their default configuration from Weka, that makes all experiments repeatable with other datasets. The datasets used in the experiment were collected from UCI database [1]. The presented results were obtained with 10 fold cross validation, and a pair-wise T-Test with the confidence interval of 95%. For all examples, the total number of iterations considered was 100. 18 datasets were chosen from UCI database in this experiment. SVM was not able to classify some of these datasets, because of missing data for some attributes. When an algorithm was not able to execute on a dataset, it is indicated with NA(Not Available). Tables 2 - 4 present the comparison of AdaBoost Dynamic with to AdaBoost.M1 with same weak learners that were used in AdaBoost Dynamic.These tables use the following notation: if an algorithm is statistically better than AdaBoost Dynamic, the symbol ◦ appears at its side, in the opposite situation the symbol • appears. Lack of symbol indicates that the difference in performance is not significant. Table 2 shows results for the comparison among AdaBoost DynamicAdaBoost.M1 with Random Forest, Bayes Network, Naive Bayes and Decision Stump. Results show that Decision Stump has a weak performance in nine of the simulations, and only in two simulations the performance was statistically better than AdaBoost Dynamic, in all other experiments the performance was equal. Considering other algorithms, AdaBoost with Random Forest and AdaBoost with Naive Bayes had worse performance four times
Extending AdaBoost to Iteratively Vary Its Base Classifiers
387
and had better performance in two and one times, respectively. AdaBoost with Bayesian Network had worse performance five times and better performance in two experiments. Table 3 shows results of comparing comparing AdaBoost Dynamic and AdaBoost.M1 with Random Tree, ZeroR, Naive Bayes Trees (NBTrees) and SVM as base classifiers. In this case, the worst “weak” classifier was ZeroR, that was statistically inferior to AdaBoost Dynamic 11 times - for more than 50% of datasets. Only once this classifier was better than AdaBoost Dynamic. Other “weak” classifier that with a low accuracy level was Random Tree. In this case eight times the classifier was statistically inferior to AdaBoost Dynamic, and all other datasets yielded equal performance. The NBTrees is a classifier that was statistically superior to AdaBoost Dynamic four times and was inferior only two times. SVM was inferior four times and had no better result than AdaBoost Dynamic. Table 4 shows the results comparing AdaBoost Dynamic and AdaBoost.M1 with Bagging and with Neural Network. Bagging had inferior results six times and only one time offered a better result. AdaBoost.M1 with Neural Network had only one superior result and in other experiments only equal performance to AdaBoost Dynamic. Using AdaBoost.M1 the analyst will have to look for the best “weak” learner to be applied with the dataset, making the work hard. AdaBoost Dynamic offers a solution to this problem, because it uses various algorithms trying to improve the final hypothesis considering each one. In this way, this approach leaves to the machine the search for the model, instead of the machine learning analyst. Table 2. Results of the T-Test comparing the Implemented solution (AdaBoost Dynamic (1)) with AdaBoost.M1 with Random Forest(2), Bayes Network(3), Naive Bayes(4) and Decision Stump(5) Dataset (1) (2) (3) contact-lenses 74.17±28.17 75.67±28.57 74.33±26.74 76.17 iris 95.13± 4.63 94.73± 5.04 93.73± 5.98 95.07 segment 95.81± 2.06 96.88± 1.96 93.84± 1.93 • 83.84 soybean 93.45± 2.82 92.16± 2.91 93.35± 2.65 92.05 weather 68.00±43.53 65.00±41.13 63.00±40.59 59.00 weather.symbolic 61.00±41.79 72.00±38.48 71.00±37.05 67.00 au1-balanced 76.95± 3.86 80.71± 3.19 ◦ 73.15± 4.23 76.16 au1 72.25± 3.79 73.96± 3.61 74.02± 0.86 72.60 CTG 100.00± 0.00 99.89± 0.25 99.86± 0.27 99.35 BreastTissue 64.28±11.46 71.14±11.75 65.17±12.54 67.75 crx 81.42± 4.49 84.96± 3.87 ◦ 86.28± 3.77 ◦ 81.06 car 99.14± 0.92 93.87± 2.02 • 90.60± 2.34 • 90.25 cmc 54.08± 3.81 50.83± 3.69 • 50.15± 3.85 • 49.04 glass 96.31± 3.95 97.84± 3.22 97.54± 3.27 93.37 zoo 95.66± 5.83 89.92± 8.49 96.05± 5.61 96.95 blood 77.93± 3.56 73.13± 4.11 • 75.01± 4.09 • 77.01 balance-scale 89.71± 4.72 74.74± 4.85 • 74.44± 6.86 • 92.13 post-operative 56.11±13.25 58.33±10.98 66.44± 8.79 ◦ 66.89 ◦, • statistically significant improvement or degradation
(4) ±25.43 ± 5.73 ± 3.85 ± 3.05 ±42.86 ±39.07 ± 3.91 ± 1.99 ± 0.53 ±13.69 ± 4.14 ± 2.48 ± 3.98 ± 5.63 ± 4.75 ± 3.07 ± 2.90 ± 8.05
•
• • •
◦
(5) 72.17±27.12 94.60± 5.33 29.43± 1.08 27.96± 2.13 67.00±40.34 67.50±39.81 77.90± 3.44 72.95± 2.46 45.30± 0.27 40.65± 4.97 86.17± 4.22 70.02± 0.16 42.70± 0.25 67.82± 2.50 60.43± 3.06 78.84± 3.40 71.77± 4.24 67.11± 8.19
• •
• • ◦ • • • • • ◦
388
´ E.N. de Souza and S. Matwin
Table 3. Results of the T-Test comparing the Implemented solution (AdaBoost Dynamic (1)) with AdaBoost.M1 with Random Tree(2), ZeroR(3), NBTrees (4) and SVM(5). The values with NA are due to the fact that SVM does not process datasets with missing values. Dataset (1) (2) (3) balance-scale 89.71± 4.72 78.09± 3.88 • 45.76± 0.53 blood 77.93± 3.56 72.92± 4.11 • 76.21± 0.41 cmc 54.08± 3.81 49.38± 4.09 • 42.70± 0.25 glass 96.31± 3.95 92.53± 7.64 35.51± 2.08 post-operative 56.11±13.25 59.78±12.61 70.00± 5.12 zoo 95.66± 5.83 60.75±20.49 • 40.61± 2.92 au1 72.25± 3.79 68.50± 4.24 • 74.10± 0.30 au1-balanced 76.95± 3.86 73.65± 4.35 58.86± 0.26 BreastTissue 64.28±11.46 66.91±12.70 19.01± 1.42 car 99.14± 0.92 84.90± 3.35 • 70.02± 0.16 contact-lenses 74.17±28.17 75.50±29.91 64.33±23.69 iris 95.13± 4.63 93.53± 5.48 33.33± 0.00 weather 68.00±43.53 61.50±43.14 70.00±33.33 weather.symbolic 61.00±41.79 70.50±38.99 70.00±33.33 crx 81.42± 4.49 84.23± 5.20 55.51± 0.67 CTG 100.00± 0.00 97.37± 1.76 • 27.23± 0.16 segment 96.59± 1.58 94.49± 1.95 • 15.73± 0.33 ◦, • statistically significant improvement or degradation
• 80.78 77.61 • 51.90 • 95.72 ◦ 56.56 • 95.84 76.30 • 80.19 • 68.67 • 98.58 76.17 • 94.33 71.00 67.00 • 86.30 • 98.13 • 98.19
(4) ± 4.59 ± 3.68 ± 4.52 ± 5.02 ±14.05 ± 5.97 ± 3.92 ± 3.15 ±11.97 ± 0.85 ±25.43 ± 5.47 ±38.39 ±39.07 ± 3.96 ± 1.40 ± 1.13
(5) ± 3.48 ± 3.64 ± 4.13 ± 2.89 NA 60.24 ± 12.40 71.96 ± 3.96 76.75 ± 3.67 21.96 ± 5.10 99.39 ± 0.68 79.50 ± 24.60 96.53 ± 4.29 52.50 ± 40.44 61.50 ± 41.96 NA NA 56.22 ± 4.19
• 91.47 71.95 54.82 98.22 ◦ ◦
◦ • ◦
•
• •
•
Table 4. Results of the T-Test comparing the Implemented solution (AdaBoost Dynamic (1)) with AdaBoost.M1 with Bagging (2) and AdaBoost.M1 with Neural Network (3) Dataset (1) (2) (3) au1 72.25± 3.79 73.56± 3.32 73.22 ± 3.84 au1-balanced 76.95± 3.86 79.49± 2.81 NA BreastTissue 64.28±11.46 70.51±13.11 65.31 ± 11.38 soybean 93.45± 2.82 88.05± 3.18 • 93.35 ± 2.68 balance-scale 89.71± 4.72 77.25± 4.38 • 93.06 ± 3.13 ◦ blood 77.93± 3.56 73.04± 4.16 • 78.47 ± 2.85 cmc 54.08± 3.81 51.21± 3.63 • 54.33 ± 3.90 glass 96.31± 3.95 97.75± 3.23 95.89 ± 4.00 post-operative 56.11±13.25 57.22±12.27 57.22 ± 12.96 zoo 95.66± 5.83 42.59± 4.93 • 95.66 ± 5.83 crx 81.42± 4.49 83.36± 5.01 83.14 ± 4.18 CTG 100.00± 0.00 99.99± 0.08 100.00 ± 0.00 segment 96.59± 1.58 98.33± 0.98 ◦ 96.45 ± 1.59 car 99.14± 0.92 97.11± 1.29 • 99.40 ± 0.65 contact-lenses 74.17±28.17 74.17±28.17 74.17 ± 28.17 iris 95.13± 4.63 94.13± 5.21 96.20 ± 4.37 weather 68.00±43.53 68.00±40.53 64.00 ± 44.43 weather.symbolic 61.00±41.79 69.50±38.86 61.00 ± 41.79 ◦, • statistically significant improvement or degradation
4
Conclusion
This work introduced a small modification to AdaBoost.M1 developed by Freund and Schapire. The modification is focused on the fact that the original approach only boosted one “weak” learner. This work proposes that we can use different “weak” learners in each iteration, allowing a different weak learner be used in relation to previous iteration.
Extending AdaBoost to Iteratively Vary Its Base Classifiers
389
Authors have made available a extended version of this work in http://www. site.uottawa.ca/~edeso096/AI_2011_extended.pdf with a proof that even using different “weak” learners in each iteration the algorithm will have the same upper bound as AdaBoost.M1 approach. The is because the same assumptions and requirements made for to AdaBoost.M1 are kept in place for AdaBoost Dynamic. Experimental results suggest that, for a large majority of the datasets, the performance of AdaBoost Dynamic is as good as that of AdaBoost.M1 for the best single weak learner. AdaBoost Dynamic can therefore be used as a default algorithm that will provide a benchmark basis to which other weak learners can be compared. AdaBoost Dynamic will be successfuly used if the analyst is not sure about the best base learner to be used with AdaBoost for a a particular dataset. One possible improvement to AdaBoost Dynamic is to implement a technique to check what is the best “weak” classifier for a given distribution in a given iteration. This would allow the algorithm to offer the best weight for a distribution, and offer a better final hypothesis. Another work to be developed is to investigate if the order of execution of weak learners has some influence in the final hypothesis result. Another improvement is to change the error test in line 6 from Table 1. The test only verifies if the hypothesis error is bigger than 0.5, and than exits the loop. It is possible to allow the execution of different classifiers, in the same iteration, checking if any of them has hypothesis error smaller than 0.5. We are working on this modification.
References 1. Frank, A., Asuncion, A.: UCI machine learning repository (2010) 2. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Saitta, L. (ed.) Proceedings of the 13th International Conference on Machine Learning, pp. 148–156 (1996) 3. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 4. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Annals of Statistics 28(2), 337–407 (2000a) 5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explorations 11(1) (2009) 6. Rodr´ıguez, J.J., Alonso, C.J., Bostr¨ om, H.: Boosting interval based literals. Intell. Data Anal. 5, 245–262 (2001)
Parallelizing a Convergent Approximate Inference Method Ming Su1 and Elizabeth Thompson2 1
Department of Electrical Engineering 2 Department of Statistics University of Washington {mingsu,eathomp}@u.washington.edu
Abstract. The ability to efficiently perform probabilistic inference task is critical to large scale applications in statistics and artificial intelligence. Dramatic speedup might be achieved by appropriately mapping the current inference algorithms to the parallel framework. Parallel exact inference methods still suffer from exponential complexity in the worst case. Approximate inference methods have been parallelized and good speedup is achieved. In this paper, we focus on a variant of Belief Propagation algorithm. This variant has better convergent property and is provably convergent under certain conditions. We show that this method is amenable to coarse-grained parallelization and propose techniques to optimally parallelize it without sacrificing convergence. Experiments on a shared memory systems demonstrate that near-ideal speedup is achieved with reasonable scalability. Keywords: Graphical Model, Approximate Inference, Parallel Algorithm.
1
Introduction
The ability to efficiently perform probabilistic inference task is critical to large scale applications in statistics and artificial intelligence. In particular, such problems arise in the analysis of genetic data on large and complex pedigrees [1] or data at large numbers of markers across the genome [2]. The ever-evolving parallel computing technology suggests that dramatic speed-up might be achieved by appropriately mapping the existing sequential inference algorithms to the parallel framework. Exact inference methods, such as variable elimination (VE) and the junction tree algorithm, have been parallelized and reasonable speedup achieved [3–7]. However, the complexity of exact inference methods for a graphical model is exponential in the tree-width of the graph. For graphs with large tree-width, approximate methods are necessary. While it has been demonstrated empirically that loopy and generalized BP work extremely well in many applications [8], Yedidia et al. [9] have shown that these methods are not guaranteed to converge for loopy graphs. Recently a promising parallel approximate inference method was presented by Gonzalez et al., [10], where loopy Belief Propagation (BP) C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 390–395, 2011. c Springer-Verlag Berlin Heidelberg 2011
Parallelizing a Convergent Approximate Inference Method
391
was optimally parallelized, but without guarantee of convergence. The UPS algorithm [11] has gained popularity due to its reasonably good performance and ease of implemention [12, 13]. More important, the convex relaxation method which incorporates UPS as a special case, is guaranteed to converge under mild conditions [14]. In this paper, we develop an effective parallel generalized inference method with special attention to the UPS algorithm. Even though the generalized inference method possesses a structural parallelism that is straightforward to extract, problems of imbalanced load and excessive communication overhead can result from ineffective task partitioning and sequencing. We focus on solving these two problems and demonstrating the performance of efficiently paralleled algorithms on large scale problems using a shared memory system.
2
Convex Relaxation Method and Subproblem Construction
The convex relaxation method relies on the notion of region graphs to faciliate the Bethe Approximation. In the Bethe approximation, one minimizes the Bethe free energy function and uses its solution to obtain an estimate of the partition function and true marginal distributions [14]. The Bethe free energy is a function of terms known as the pseudo-marginals. Definitions and examples of the Bethe approximation, Bethe region graphs and pseudo-marginals can be found in [9, 15]. The UPS algorithm and the convex relaxation method were based on the fact that if the graphical model admits a tree-structured Bethe region graph, the associated Bethe approximation is exact [9, 15]. That is, minimization of the Bethe free energy is a convex optimization problem. We obtain a convex subproblem by fixing the pseudo-marginals associated with a selected subset of inner regions to a constant vector. The convex relaxation method works by first finding a sequence of such convex subproblems then repeatedly solving them until convergence. Graphically, the subproblems are defined over a sequence of tree-structured subgraphs. Simple schemes of finding these subgraphs in grid graphs are proposed in [11]. However, these schemes are not optimal and cannot be extended to general graphs. We present a hypergraph spanning tree algorithm that is more effective and is applicable to general graphs. With the hypergraph representation, the problem of finding these subgraphs, which otherwise requires ad hoc treatment in bipartite region graphs, becomes well-defined. The definition of hypergraphs, hyperedges, hypergraph spanning trees and hyperforests can be found in [16]. In the hypergraph representation, nodes and hyperedges correspond to outer regions and inner regions, respectively. Specifically, an inner region can be regarded as a set, whose elements are adjacent outer regions. In the Greedy Sequencing procedure developed by [14], all outer regions are included in each subproblem. The sequence of tree-structured subgraphs corresponds to a sequence of spanning hypertrees. In general, a spanning tree in a hypergraph may not exist and even determination of its existence is strongly NP-complete [16].
392
M. Su and E. Thompson
1 T1
T2
…
2
Tm 3
Barrier 1
4
5
6
1
2
Map T1
…
T2
Reduce Barrier 2
Tm
4
(a)
3'
(b)
Fig. 1. (a) MapReduce flowchart for a sequence of size 2; (b) Coarsening by contracting edge 3, 4 and 5
We develop a heuristic, hyperspan, by extending Kruskal’s minimum spanning tree algorithm for ordinary graphs. We apply hyperspan repeatedly to obtain a sequence of spanning hyperforests. In this context, the convergence crierion of [14] translates to a condition that every hyperedge has to appear in at least one spanning forest. The Greedy Sequencing procedure guarantees that, in the worst case, the convergence criterion is still satisfied. Interestingly, for a grid graph model with arbitrary size, the greedy sequencing procedure returns a sequence of size two, which is optimal.
3
Parallel and Distributed Inference
In the greedy sequencing procedure, if a subproblem is defined on a forest rather than on a tree, we can run Iterative Scaling (IS) on disconnected components, independently and consequently in parallel. This suggests a natural way of extracting coarse-grained parallelism uniformly across the sequence of subproblems. The basic idea is to partition the hypertree or even the hyperforest into a prescribed number, t, of components and assign the computation associated with each component to a separate processing unit. There is no communication cost incurred among the independent computation tasks. This maps to a coarse-grained MapReduce framework [17] as shown in Figure 1(a). Note that synchronization, accomplished by software barriers, is still required at the end of each inner iteration. In this paper, we only focus on mapping the algorithm to a shared memory system. Task partitioning is performed using a multilevel hypergraph partitioning program hMETIS [18]. Compared to alternative programs, it has much shorter solution time and more importantly, it produces balanced partitions with a significantly fewer cut edges. The convergence crierion states that every hyperedge has to appear in at least one spanning forest [14]. This means no hyperedge is allowed to be always a cut edge. A simple technique, edge contraction, prevents a hyperedge from being a cut edge. When a hyperedge is contracted, it is replaced by a super node, containing this edge and all nodes that are adjacent to this edge. All other edges that are previously adjacent to any of these nodes become
Parallelizing a Convergent Approximate Inference Method
393
adjacent to the super node (Figure 1(b)). After we partition once, we can contract a subset of cut edges, resulting in a coarsened hypergraph, repartitioning on which will not have any cut placed on the contracted edges. Near optimal speedup is only achieved when we have perfect load balancing. Knowing that IS solution time is proportional to the number of nodes, we perform weighted partitionings. The weight of a node is 1 for a regular node. For a super node, the weight is the number of contained regular nodes. Reasonable load balance is achieved through weighted partitioning when the average interaction between adjacent random variables is not too high. For high interaction, partitioning-based static load balancing (SLB) performs poorly. In Section 4, we show this effect and propose some techniques to accommodate it. We adopted the common multithreading scheme, where in general, n threads are created in a n-core system and each thread is distributed to a separate core. Thread synchronization ensures that all subproblems converge. We use nonblocking send and blocking receive because they are more efficient for the implementation. For efficiency purpose, pseudo-marginals are sent and received in one package rather than individually. Sender and receiver, respectively, use the predefined protocol to packing and unpacking the aggregate into individual pseudo-marginal messages. Our experimenting environment is a shared memory 8-core system with 2 Intel Xeon Quad Core E5410 2.33 GHz processors with Debian Linux. We implemented the algorithms in the Java programming language using MPJ Express, an open source Java message passing interface (MPI) library that allows application developers to write and execute parallel applications for multicore processors and computer clusters/clouds.
4
Experiments and Results
The selected classof test problems are 100 × 100 Ising models, with joint distri bution P (x) ∝ e i∈V αi xi + (i,j)∈E βij xi xj , where V and E are nodes and edges of graph. αi ’s are uniformly drawn from [−1, 1] and βij ’s are uniformly drawn from [-β, β]. When β > 1, loopy BP fails to converge even for small graphs. Due to synchronization, the slowest task will determine the overall performance. The SLB introduced in Section 3 performs worse as β increases. In practice, we apply two runtime techniques to mitigate the problem. First, a dynamic load balancing (DLB) scheme is developed. Instead of partitioning the graph into n components and distributing them to n threads, we partition the graphs into more components and put them into a task pool. At runtime, each thread fetches a task from the pool onces it finishes with its current task. The use of each core is maximized and the length of bottleneck task is shortened. The second technique is the bottleneck task early termination (ET). A thread is terminated when all other threads become idle and no task is left in the pool. However terminating a task prematurely has two undesirable effects. First, it breaks the convergence requirement. Second, it may change the convergence rate. In order to ensure convergence, we can occasionally switch back to non-ET mode, especially when oscillation of messages is detected.
M. Su and E. Thompson
*
*
- .
!" #$%&' !" $
*
$ #$%&'
* *
394
*
*
+ ,
()
Fig. 2. (a) Load balance: DLB & ET vs. SLB. Normalized load (w.r.t. the largest) shown for each core. 3 cases listed: 2 cores (upper left), 4 cores (upper right) and 8 cores (bottom). (b) Speedup: DLB & ET vs. SLB.
With β = 1.1, we randomly generated 100 problems. The number of cores ranges from 2 up to 8 to demonstrate both raw speedup and scalability. Speedup is defined as the ratio between sequential and parallel elapsed time. At this interaction level, the sequential run time exceeds 1 minute giving rise to parallelization, and SLB starts performing poorly. Figure 2(a) shows that with SLB, poor balance results irrespective of the number of cores used. This is dramatically mitigated by DLB and ET. Notice that almost perfect balance is achieved for a small number of cores (2,4), but with 8 cores the load is less balanced. The average speedup over 100 problems is shown in Figure 2(b), both for using SLB and for using DLB and ET. DLB and ET universally improved the speedup and the improvement became more prominent as the number of cores increased. With DLB and ET, the speedup approaches the ideal case until the number of cores reaches 6. We attribute this drop-in-speedup trend to two factors. First, as shown in Figure 2(a), even with DLB and ET, load becomes less balanced as the number of cores increases. Second, there is an increased level of resource contention in terms of memory bandwidth. The BP algorithm frequently accesses memory. As more tasks are running in parallel, the number of concurrent memory accesses also increases.
5
Discussion
In the paper, we proposed a heuristic for subproblem construction. This heuristic has been shown to be effective and is provably optimal for grid graphs. Thorough testing on a complete set of benchmarking networks will be important in evaluating the performance of the heuristic. Our parallel implementation is at the algorithmic level, which indicates that it can be combined with other lower level parallelization techniques proposed by other researchers. Experiments on a shared memory system exhibit near-ideal speedup with reasonable scalability. Further exploration is necessary to demonstrate that the speedup scales up in practice on large distributed memory systems, such as clusters.
Parallelizing a Convergent Approximate Inference Method
395
Acknowledgments. This work is supported by NIH grant HG004175.
References 1. Cannings, C., Thompson, E.A., Skolnick, M.H.: Probability functions on complex pedigrees. Advances in Applied Probability 10, 26–61 (1978) 2. Abecasis, G.R., Cherny, S.S., Cookson, W.O., Cardon, L.R.: Merlin – rapid analysis of dense genetic maps using sparse gene flow trees. Nature Genetics 30, 97–101 (2002) 3. Shachter, R.D., Andersen, S.K.: Global Conditioning for Probabilistic Inference in Belief Networks. In: UAI (1994) 4. Pennock, D.: Logarithmic Time Parallel Bayesian Inference. In: UAI, pp. 431–443 (1998) 5. Kozlov, A., Singh, J.: A Parallel Lauritzen-Spiegelhalter Algorithm for Probabilistic Inference. In: Proceedings of the 1994 Conference on Supercomputing, pp. 320– 329 (1994) 6. Namasivayam, V.K., Pathak, A., Prasanna, V.K.: Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. In: 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2006), pp. 167–176 (2006) 7. Xia, Y., Prasanna, V.K.: Parallel exact inference on the cell broadband engine processor. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–12 (2008) 8. Botetz, B.: Efficient belief propagation for vision using linear constraint nodes. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2007) 9. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Generalized belief propagation. In: NIPS, pp. 689–695. MIT Press, Cambridge (2000) 10. Gonzalez, J., Low, Y., Guestrin, C., O’Hallaron, D.: Distributed Parallel Inference on Large Factor Graphs. In: UAI (2009b) 11. Teh, Y.W., Welling, M.: The unified propagation and scaling algorithm. In: NIPS, pp. 953–960 (2001) 12. Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 350–362. Springer, Heidelberg (2004) 13. Xie, Z., Gao, J., Wu, X.: Regional category parsing in undirected graphical models. Pattern Recognition Letters 30(14), 1264–1272 (2009) 14. Su, M.: On the Convergence of Convex Relaxation Method and Distributed Optimization of Bethe Free Energy. In: Proceedings of the 11th International Symposium on Artificial Intelligence and Mathematics (ISAIM), Fort Lauderdale, Florida (2010) 15. Heskes, T.: Stable fixed points of loopy belief propagation are local minima of the Bethe free energy. In: NIPS, pp. 343–350 (2002) 16. Tomescu, I., Zimand, M.: Minimum spanning hypertrees. Discrete Applied Mathematics 54, 67–76 (1994) 17. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the Sixth Symposium on Operating System Design and Implementation, San Francisco, CA (2004) 18. Karypis, G., Kumar, V.: hMETIS: A Hypergraph Partitioning Package (1998), http://glaros.dtc.umn.edu/gkhome/fetch/sw/hmetis/manual.pdf
Reducing Position-Sensitive Subset Ranking to Classification Zhengya Sun1 , Wei Jin2 , and Jue Wang1 1
2
Institute of Automation, Chinese Academy of Sciences Department of Computer Science, North Dakota State University
Abstract. A widespread idea to attack ranking works by reducing it into a set of binary preferences and applying well studied classification techniques. The basic question addressed in this paper relates to whether an accurate classifier would transfer directly into a good ranker. In particular, we explore this reduction for subset ranking, which is based on optimization of DCG metric (Discounted Cumulated Gain), a standard position-sensitive performance measure. We propose a consistent reduction framework, guaranteeing that the minimal DCG regret is achievable by learning pairwise preferences assigned with importance weights. This fact allows us to further develop a novel upper bound on the DCG regret in terms of pairwise regrets. Empirical studies on benchmark datasets validate the proposed reduction approach with improved performance.
1
Introduction
Supervised rank learning tasks often boil down to the problem of ordering a finite subset of instances in an observable feature space. This task is referred to as subset rank learning [8]. One straightforward and widely-known solution for subset ranking has been based on a reduction to binary classification tasks considering all pairwise preferences on the subset. Numerous ranking algorithms fall within the scope of this approach, i.e., building ranking models by running classification algorithms for binary preference problems [5–7, 9, 10, 12]. Ranking models are often evaluated by position-sensitive performance measures [15], which assign each rank position a discount factor to emphasize the quality near the top. This presumable difference poses a question on whether an accurate classifier would transfer directly into a good ranker. Applications of the aforementioned algorithms seem to support a positive answer. In this paper, we attempt to provide a theoretical support for this phenomenon based on well-established regret transform principles [4, 14], a mainstay of reduction analysis. Roughly speaking, regret here describes the gap between the incurred loss and the minimal loss. Relevant work has shown that the ranking problem can be solved robustly and efficiently with binary classification techniques. The proved regret bounds for those reductions, however, mostly focus on measures that are not positionsensitive. For example, Balcan et al. [2] proved that the regret of ranking, as measured by the Area Under the ROC Curve (AUC), is at most twice as much as that of the induced binary classification. Ailon and Mohri [1] described a C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 396–407, 2011. c Springer-Verlag Berlin Heidelberg 2011
Reducing Position-Sensitive Subset Ranking to Classification
397
randomized reduction which guarantees that the pairwise misranking regret is not more than the binary classification regret. These inspiring results lead one to seek regret guarantees for ranking under position-sensitive criteria which have gained an enormous popularity in practice, such as Discounted Cumulative Gain (DCG) [11]. Although [18] demonstrates, to some extent, the usefulness of position-sensitive ranking using importance weighted classification techniques, there is a lack of theoretical analysis on the principle for a successful reduction. The following critical questions remain unexplored. These are • guarantees that the reduction is consistent, in the sense that given optimal (zero-regret) binary classifiers, the reduction can yield an optimal ranker, such that the expected position-sensitive performance measure is maximized; • the regret bounds which demonstrate that the decrease of the classification regret may provide a reasonable approximation for the decrease of the ranking regret of interest. Our current study aims at addressing these problems. Although the first aspect has been analogously pointed out in [8], there has been no comprehensive theoretical analysis to our knowledge. We characterize the DCG metric by a combination of ‘relevance gain’ and ‘position discount’, and prove that under suitable assumptions, the sufficient condition on consistent reduction is given by learning pairwise preferences assigned with importance weights according to relevance gains. In particular, we derive an importance weighted loss function for the reduced binary problems that exhibits good properties in preserving an optimal ranking. Such properties provide reassurance that optimizing the resulting binary loss in expectation does not hinder the search for a zero-regret ranker, and allow such a search to proceed within the scope of off-the-shelf classification algorithms. Subsequently, we quantify the reduction with consistency guarantee on at most how much the classification regret can be transferred into the position-sensitive ranking regret. Our regret analysis is based on the rank-adjacent transposition strategies which are first used to convert DCG regret into multiple pairwise regrets. This, coupled with the majorization inequality proved by Balcan et al. (2007), allows us to yield upper bound in terms of the sum of the importance weighted classification regrets over the induced binary problems. This bound is scaled by a position-discount factor, i.e., 2 times the maximum deviation between adjacent position discounts (< 1). This constant does not depend on how many instances are ranked, and can be regarded as an improvement over that in the case of subset ranking using the regression approach [8]. Our results reveal the underlying connection between position-sensitive ranking and binary classification, which is the improvement of the classification accuracy can reasonably enhance the position-sensitive ranking performance. This paper is organized as follows. Section 2 formulates subset ranking problem and analyzes its optimal behavior. Section 3 presents pairwise classification formulations, and describes a generic framework for subset ranking. Section 4 is devoted to the proof of our main results, and section 5 present empirical evidence on benchmark datasets. The conclusions are drawn in section 6.
398
2
Z. Sun, W. Jin, and J. Wang
Subset Ranking Problem
We consider the subset ranking problem described as follows. Provided with labeled subsets, the ranker learns to predict a mapping from a finite subset to an ordering over the instances in it. Each labeled subset is assumed to be generated as S = {(xi , yi )}ni=1 ⊆ X × Y, where xi is an instance in some feature space X , and the associated relevance label yi belongs to the set Y = {0, . . . , l − 1}, with l − 1 representing the highest relevance and 0 the lowest. 2.1
Notation
We denote the finite subset as X = {xi }ni=1 ∈ U where U is the set of all finite subsets of X , and the associated relevance label set as Y = {yi}ni=1 . For simplicity, the size of the subset n remains fixed throughout our analysis. We represent the ordering as a permutation π on [n] = {1, . . . , n}, using π(i) to denote the ranked position given to the instance xi , and π −1 (j) to denote the index of the instance ranked at jth position. The set of all possible permutations is denoted as Ω. For sake of brevity, we define an instance assignment vector x = [xπ−1 (1) , . . . , xπ−1 (n) ] and a relevance assignment vector y = [yπ−1 (1) , . . . , yπ−1 (n) ] according to π, where xi = xπ−1 (i) represents the instance ranked at the ith position, and yi = yπ−1 (i) represents the relevance label assigned to the instance ranked at the ith position. 2.2
Discounted Cumulative Gain (DCG)
Based on the perfect ordering π ¯ which is in non-increasing order of the relevance labels, we evaluate the quality of the estimated ordering with DCG(π, Y ). Unlike other ranking measures like AUC, DCG not only assesses each instance i by a relevance gain g(yi , Y ), but also discriminates each position π(i) by a discount factor dπ(i) , allowing the evaluation to concentrate on the top rank positions 1 [11]. Let g(yi , Y ) = 2yi − 1 and dπ(i) = log2 (1+π(i)) , then we have DCG(π, Y ) =
n
g(yi , Y ) · dπ(i) .
i=1
The discount factor defined above is positive and strictly decreasing, i.e., ∀π(i) < π(j), dπ(i) > dπ(j) > 0. When only the top k(k < n) instances need to be ranked correctly, dπ(i) is set to zero for π(i) > k. 2.3
Ranking Formulations
In the standard supervised learning setup, the ranking problem that we are investigating can be defined as follows.
Reducing Position-Sensitive Subset Ranking to Classification
399
Definition 1. (position-sensitive subset ranking) Assume that each labeled subset S = {(xi , yi )}ni=1 ⊆ X ×Y is generated independently at random according to some (unknown) underlying distribution D. The ranker works with a ranking function space H = {h : U → Ω} which maps a set X ∈ U to a permutation π, namely h(X) = π. The position-sensitive ranking loss of the predictor h on a labeled subset S is defined as π , Y ) − DCG(h(X), Y ). lrank (h, S) = DCG(¯ The learning goal is to find a predictor h so that the expected position-sensitive ranking loss with respect to D, given by Lrank (h, D) = ES∼D lrank (h, S) = EX Lrank (h, X),
(1)
is as small as possible, where Lrank(h, X) = EY |X lrank (h, S).
(2)
The loss lrank quantifies our intuitive notion of ‘how far the predicted permutation is from the perfect permutation’ based on DCG metric. The loss becomes minimum of zero when the subset X is ranked in non-increasing order of the relevance labels in Y ; maximal when in non-decreasing order. To characterize the optimal ranking rule with the minimum loss in (1), it is reasonable to analyze its conditional formulation Lrank (h, X) as a starting point. ˆ Lemma 1. Given a set X ∈ U, we define the optimal subset ranking function h ˆ as a minimizer of the conditional expectation in (2). Let π ˆ = h(X) be the output ˆ Then for any dπˆ (i) > dπˆ (j) , i, j ∈ [n], it holds that permutation by h. E(yi ,yj ,Y )|(xi ,xj ,X) (g(yi , Y ) − g(yj , Y )) ≥ 0. Lemma 1 explicitly states that given X, the optimal subset ranking is in nonincreasing order of the relevance gain functions in expectation. The proof of lemma 1 is straightforward, and omitted due to space limitation.
3
Reductions to Binary Classification
In this section, we turn to the reduction method which decomposes subset ranking problems into importance weighted binary classification problems considering all weighted pairwise preferences between two instances. 3.1
Classification Formulations
In importance weighted binary classification, each instance-label pair is supplied with a non-negative weight which specifies the importance to predict this instance correctly [3]. The corresponding formulation [3, 14] can be naturally extended to learn pairwise preferences, which is defined as follows.
400
Z. Sun, W. Jin, and J. Wang
Procedure 1. Binary Train (labeled set S of size n, binary classification learning algorithm A) Set T = ∅. for all ordered pairs(i, j) with i, j ∈ [n], i = j: Set wij (X, Y ) = |g(yi , Y ) − g(yj , Y )|. Add to T an importance weighted example ((xi , xj , X), I(yi > yj ), wij (X, Y )). end for Return c = A(T ).
Procedure 2. Rank Predict (instance set X, binary classifier c) for each xi ∈ X: f (xi , X) = 12 j=i (c(xi , xj , X) − c(xj , xi , X) + 1), where xj ∈ X. end for Sort X in non-increasing order of f (xi , X).
Definition 2. (importance weighted binary classification for pairwise preferences) Assume that each triple tij = ((xi , xj ), I(yi > yj ), wij ) ∈ (X × X )×{0, 1}×[0, +∞) is generated at random according to some (unknown) underlying distribution P, where I(·) is 1 when the argument is true and 0 otherwise, and [0, +∞) indicates the importance of the correct classification. The classifier works with a preference function space C = {c : X × X → {0, 1}} which maps an ordered pair (xi , xj ) ∈ X × X to a binary relation. The importance weighted classification loss of the predictor c on a triple tij is defined as lclass (c, tij ) =
1 wij · I(yi > yj ) · (1 − c(xi , xj ) + c(xj , xi )). 2
(3)
The learning goal is to find a predictor c such that the expected importance weighted classification loss with respect to P, given by Lclass (c, P) =Etij ∼P lclass (c, tij ),
(4)
is as small as possible. When learning pairwise preferences, the binary classifier c decides for each ordered pair (xi , xj ) whether xi or xj is preferred. A perfect prediction preserves the target preference between two alternatives, i.e., yi > yj ⇔ c(xi , xj ) − c(xj , xi ) = 1, and a non-zero loss is incurred otherwise. When wij = 1, the expected loss Lclass is simply the probability that discordant pairs happen assuming that ties are broken at random. 3.2
Ranking a Subset with Binary Classifiers
We introduce a general framework for ranking a subset with a binary classifier, which unifies a large family of pairwise ranking algorithms such as Ranking SVMs [10], RankBoost [9] and RankNet[5]. This framework is composed of two procedures as described below.
Reducing Position-Sensitive Subset Ranking to Classification
401
The training procedure (Binary Train) takes a set S of labeled instances in X × {0, . . . , l − 1} and transforms every pair of labeled instances into two binary classification examples, each of which is augmented with a non-negative weight. By running a binary learning algorithm A on the transformed example set T , a classifier of the form c : X × X × U → {0, 1} is obtained, where xi , xj ∈ X. ˜ on the binary classifier c. To genWe then define the induced distribution D erate a sample from this distribution, we first draw a random labeled set S from the original distribution D, and subsequently draw uniformly from S an ordered pair (i, j) which is translated into ((xi , xj , X), I(yi > yj ), wij (X, Y )). We define the importance weight function wij (X, Y ) by wij (X, Y ) = |g(yi , Y ) − g(yj , Y )| .
(5)
Intuitively, the larger the difference between the relevance gains associated with two different examples, the more important it is to predict the preference between them correctly. In theory, this choice of weights enjoys sound regret properties which will be investigated in the next section. The test procedure (Rank Predict) assigns a preference degree to each instance xi according to the degree function f (xi , X), which increases by 1 if xi is strictly preferred to xj such that c(xi , xj , X) − c(xj , xi , X) = 1, and 12 if xi is regarded as equally good as xj such that c(xi , xj , X) − c(xj , xi , X) = 0. These instances are then sorted in non-increasing order of the preference degrees.
4
Regret Analysis
We now apply the well-established regret transform principle to analyze the reduction from subset ranking to binary classification. We first prove a guarantee on the consistency of a reduction when zero regret is attained. Then we provide novel regret bounds when non-zero regret is attained. 4.1
Consistency of Reduction Methods
We shall rewrite (4) by replacing the original distribution P with the induced ˜ due to the reduction: distribution D ˜ = 1 ES∼D Lclass (c, D) lclass (c, tij , S) = EX Lclass (c, X), (6) Z (i,j)
where Z = ES∼D Lclass (c, X) =
(i,j)
wij · I(yi > yj ) is the normalization constant, and
1 EY |X lclass (c, tij , S) Z (i,j)
1 = E (lclass (c, tij , S) + lclass (c, tji , S)). Z i,j (yi ,yj ,Y )|(xi ,xj ,X)
(7)
402
Z. Sun, W. Jin, and J. Wang
Lemma 2. Given a set X ∈ U, define the optimal subset preference function cˆ ∈ C as a minimizer of (7). Let the importance weights be defined as in (5). Then for cˆ(xi , xj , X) − cˆ(xj , xj , X) = 1, it holds that E(yi ,yj ,Y )|(xi ,xj ,X) (g(yi , Y ) − g(yj , Y )) ≥ 0. Proof. Note that (7) takes its minimum when each conditional expectation term in the summation achieves the minimum. Substituting (3) and (5) into (7), we have 1 · E(yi ,yj ,Y )|(xi ,xj ,X) [wij · I(yi > yj ) · (1 − cˆ(xi , xj , X) + cˆ(xj , xi , X)) 2 + wij · I(yj > yi ) · (1 − cˆ(xj , xi , X) + cˆ(xi , xj , X))] 1 = · E(yi ,yj ,Y )|(xi ,xj ,X) [(ˆ c(xj , xi , X) − cˆ(xi , xj , X)) · (I(yi > yj ) + I(yj > yi )) 2 · (g(yi , Y ) − g(yj , Y )) + (I(yi > yj ) + I(yj > yi )) · wij ] 1 = · [(ˆ c(xj , xi , X) − cˆ(xi , xj , X)) · E(yi ,yj ,Y )|(xi ,xj ,X) (g(yi , Y ) − g(yj , Y )) 2 + E(yi ,yj ,Y )|(xi ,xj ,X) wij ]. Assume by contradiction that E(yi ,yj ,Y )|(xi ,xj ,X)(g(yi , Y ) − g(yj , Y )) < 0. Consider any k, k ∈ {1, . . . , n}, there exists a preference function c ∈ C such that c(xk , xk , X) − c(xk , xk , X) = cˆ(xk , xk , X) − cˆ(xk , xk , X), when k, k = i, j, and c(xi , xj , X) − c(xj , xi , X) = −1. Then we get that Lclass(c, X) < Lclass (ˆ c, X) which stands in contradiction to the subset preference optimality of cˆ.
The above lemma together with the result obtained in lemma 1 allows us to derive the following statement. Theorem 1.Consider position-sensitive subset ranking using importance weighted classification. Let the importance weights be defined as in (5). Let Rank Predict(ˆ c) be an ordering induced by the optimal subset preference function cˆ with respect to X ∈ U. Then it holds that ˆ X). Lrank (Rank Predict(ˆ c), X) = Lrank (h, ˆ is the optimal subset ranking function that minimizes the conditional where h expectation in (2) with respect to X. The theorem states conditions that lead to a consistent reduction method in the sense that given an optimal (zero-regret) binary classifier, the reduction can yield a ranker with minimal expected loss conditioned on X. 4.2
Regret Bounds
Here, regret quantifies the difference between the achieved loss and optimal loss in expectation. More precisely, the regret of h on the subset X is ˆ X), Rrank (h, X) = Lrank (h, X) − Lrank (h, ˆ is the optimal subset ranking function as defined previously. where h
(8)
Reducing Position-Sensitive Subset Ranking to Classification
403
Similarly, the regret of c on the subset X is Rclass (c, X) = Lclass (c, X) − Lclass (ˆ c, X),
(9)
where cˆ is the optimal subset preference function as defined previously. Note that Rclass (c, X) is scaled by a normalization constant which relies on the summation of importance weights for the induced pairwise preferences, while this is not used in Rrank (h, X). For fairness and simplicity, we leave out the ˜ class (c, X) = Z · Rclass (c, X). We normalization constant in Rclass (c, X), and let R then provide an upper-bound that relates the subset ranking regret Rrank (h, X) ˜ class (c, X). Before continuing, we need to the cumulative classification regret R to present some auxiliary results for proving the regret bounds. Definition 3. (proper pairwise regret) Given a set X, for any two instances xi , xj ∈ X, we denote the pairwise loss of ordering xi before xj by Lpair (xi , xj , X) = E(yi ,yj ,Y )|(xi ,xj ,X) wij (X, Y ) · I(yj > yi ), and denote the associated pairwise regret by Rpair (xi , xj , X) = max(0, Lpair (xi , xj , X) − Lpair (xj , xi , X)). If Lpair (xi , xj , X) − Lpair (xj , xi , X) ≥ 0, then Rpair (xi , xj , X) is called proper. The above definition is parallel to the proper pairwise regret defined in [2] with respect to AUC loss function. Lemma 3. Let the importance weights be defined as in (5). For any i, j, k ∈ [n], if Rpair (xi , xj , X) and Rpair (xj , xk , X) are proper, then Rpair (xi , xk , X) = Rpair (xi , xj , X) + Rpair (xj , xk , X). The proof of lemma 3 is straightforward, and omitted due to space limitation. Lemma 4. For any sequence (a1 , . . . , an ), let (a(1) , . . . , a(n) ) be a sequence sort ing the values of (a1 , . . . , an ) in non-increasing order. ∀i ∈ N, let 2i = i·(i−1) . 2 If (a(1) , . . . , a(n) ) is majorized by (n − 1, . . . , 0), then for any j ∈ [n − 1], it holds that n j n n−j I(av ≥ au ) ≤ 2 · av − . 2 u=1 v=j+1
v=j+1
This proof has appeared in [2]. Majorization is originally introduced in [16]: A sequence (a1 , . . . , an ) majorizes a sequence (b1 , . . . , bn ) if and only if a1 ≥ . . . ≥ k k n n an , b1 ≥ . . . ≥ bn and j=1 aj ≥ j=1 bj when k < n and j=1 aj = j=1 bj . In what follows, we re-index the instances in X according to π ˆ , i.e., j = (ˆ π )(−1) (j). Taking π ˆ as the target permutation, any permutation π on the same set can be transformed into π ˆ via successive rank-adjacent transpositions [13]. By flipping one discordant pair with adjacent ranks, we come by an intermediate
404
Z. Sun, W. Jin, and J. Wang
permutation. Let π (i) denote the intermediate permutation via i transposition operations. For convenience of modeling, we map each discordant pair in the set Γ = {(v, u) : u < v, π(v) < π(u)} to the number of adjacent transpositions required to flip it. Specifically, we adopt the transposition strategy of choosing the instance xj in increasing order of j, and transposing the discordant pairs associated with xj . More precisely, let u− ∈ {1, . . . , u − 1} and u+ ∈ {u + 1, . . . , n}, we have i= τ1 (u− , π) + I(π(u+ ) < π(u)) · I(π(v) ≤ π(u+ )), u−
u+
where τ1 (u− , π) = j I(π(j) < π(u− ))·I(u− < j) can be interpreted as the total number of discordant pairs associated with xu− . Equipped with these preparations, we are in a position to prove the upper regret bound for subset ranking problem: Theorem 2. Consider position-sensitive subset ranking on X using importance weighted classification. Let the importance weights be defined as in (5). Then for any binary classifier c, the following bound holds, ˜ class (c, X). Rrank (Rank Predict(c), X) ≤ 2(d1 − d2 ) · R
(10)
ˆ By the definition Proof. Fix c. Let Rank Predict(c) = h, Rank Predict(ˆ c) = h. of Rrank (h, X), we can rewrite the left-hand side of equation (10) as ˆ Rrank (h, X) = EY |X (DCG(h(X), Y )) − DCG(h(X), Y )). We then obtain that Rrank (h, X) = EY |X = EY |X
n
dj · (g(ˆ yj , Y ) − g(yj , Y ))
j=1
(dπ(i) (u) − dπ(i) (v) ) · (g(yu , Y ) − g(yv , Y ))
(v,u)∈Γ
≤ max(dπ(i) (u) − dπ(i) (v) ) i
Rpair (xv , xu , X)
(v,u)∈Γ
≤ (d1 − d2 ) ·
v−1
Rpair (xj+1 , xj , X)
(v,u)∈Γ j=u
= (d1 − d2 ) ·
n−1 j=1
= (d1 − d2 ) ·
|{u ≤ j < v : π(v) < π(u)}| · Rpair (xj+1 , xj , X)
n−1
j n
I(f (xv , X) ≥ f (xu , X)) · Rpair (xj+1 , xj , X). (11)
j=1 u=1 v=j+1
The second equality is due to the fact that DCG(ˆ π , Y ) − DCG(π, Y ) =
γ i=1
DCG(π (i) , Y )) − DCG(π (i−1) , Y ),
Reducing Position-Sensitive Subset Ranking to Classification
405
where π (0) = π, and γ = |Γ | denotes the total number of inversions in π (note that π (γ) is equivalent to π ˆ ). The second inequality follows by using the fact that the function (dj − dj+1 ) is monotonically decreasing with j and applying lemma 3 repeatedly. The third equality follows from algebra, and the fourth from the fact that Rank Predict outputs a permutation in non-increasing order of the degree function f . The term on the right-hand side of equation (10) can be written as ˜ class (c, X) R = E(yu ,yv ,Y )|(xu,xv ,X) (lclass (c, tuv ) + lclass (c, tvu ) − lclass (ˆ c, tuv )−lclass(ˆ c, tvu )) u,v
1 = · E(yu ,yv ,Y )|(xu,xv ,X) (I(yu > yv ) + I(yv > yu )) · (g(yu , Y ) − g(yv , Y )) 2 u,v · [(−c(xu , xv , X) + c(xv , xu , X)) + (ˆ c(xu , xv , X) − cˆ(xv , xu , X))] 1 = · [(−c(xu , xv , X) + c(xv , xu , X)) + (ˆ c(xu , xv , X) − cˆ(xv , xu , X))] 2 u,v · (E(yu ,Y )|(xu,X) g(yu , Y ) − E(yv ,Y )|(xv ,X) g(yv , Y )) 1 = · (−c(xu , xv , X) + c(xv , xu , X) + 1) · Rpair (xv , xu , X) 2 u
v−1 1 · (−c(xu , xv , X) + c(xv , xu , X) + 1) · Rpair (xj+1 , xj , X) 2 u
=
n−1 1 · (2 · |{u ≤ j < v : c(xv , xu , X) = 1, c(xu , xv , X) = 0}| 2 j=1
=
=
1 · 2
n−1 j=1
n−1 1 · 2
+ |{u ≤ j < v : c(xv , xu , X) = c(xu , xv , X)}|) · Rpair (xj+1 , xj , X) n c(xv , xu , X) − c(xu , xv , X) + 1 · Rpair (xj+1 , xj , X) 2 u=1 v=j+1
j
n c(xv , xu , X) − c(xu , xv , X) + 1 − 2
j=1 v=j+1 u=v
n−1 1 = · 2 j=1
n
n−j f (xv , X) − 2 v=j+1
n−j 2
· Rpair (xj+1 , xj , X)
· Rpair (xj+1 , xj , X).
(12)
The fourth equality follows from theorem 1 and some algebra. The last equality uses the definition of the degree function f . Comparing (11) and (12), we obtain the desired bound due to lemma 4.
The above theorem derives an upper bound which is up to a constant factor of less than 2 (due to d1 − d2 < 1) on the regret ratio, which extends and improves the previous work in the literature [1, 2, 8, 18]. We will show that the bound is also
406
Z. Sun, W. Jin, and J. Wang
the best possible. Consider a 3-element lower bound example: let the distribution have all its mass on a single 3-element subset X = {(x0 , 0), (x1 , 1), (x2 , 2)}. We have a classifier c such that c(x0 , x1 ) = 0, c(x1 , x0 ) = 1; c(x0 , x2 ) = 0, c(x2 , x0 ) = ˜ class (c, X) is 1, and 1; c(x1 , x2 ) = 1, c(x2 , x1 ) = 1. Then it is easy to check that R the worst case for Rrank (h, X) is 0.74 which is exactly 2 · (d1 − d2 ).
5
Experiments
While the focus of this work is a theoretical investigation on the reduction approach from subset ranking to classification, we have also conducted experiments that study its empirical evidence, in particular, the effect of importance weights our analysis suggests. We used a public benchmark data set called OHSUMED [17] collected from medical publications; it contains altogether 106 queries and 16,140 query-document pairs. For each query-document pair, there are 45 ranking features extracted and a 3-level relevance judgement provided, i.e. definitely relevant, possibly relevant or not relevant. For computational reasons, we employed the well-established classification principle with desirable properties [19] : modified huber loss plus l2 -regularization. We then evaluated the linear form solutions with and without the proposed importance weighting scheme, referred to as IMPairRank, and PairRank respectively. In addition, two Letor baselines which aim at directly optimizing (normalized) DCG were also chosen as comparisons. All the results presented below were averaged over five folds off-the-shelf, each of which consists of training, validation, and test set. The validation set was used to identify the best set of parameters, which was then verified on the test set. Table 1. Test NDCG for different ranking methods
NDCG@1 NDCG@2 NDCG@3 NDCG@7 NDCG@10
IMPairRank 0.5804 0.5151 0.5095 0.4671 0.4568
PairRank 0.5553 0.4981 0.5048 0.4649 0.4512
AdaRank-NDCG 0.5330 0.4922 0.4790 0.4596 0.4496
SmoothRank 0.5576 0.5149 0.4964 0.4667 0.4568
It is interesting to note that IMPairRank achieves better test NDCG results at the given positions. In fact, it has achieved the highest at all top ten positions except two. This means that the proposed weighting scheme makes the traditional pairwise classification a better approximation to the position-sensitive metric, and comparable with state of the art optimized NDCG baselines, which effectively confirms the theory.
6
Conclusion
In this paper, we attempt to provide a theoretical analysis supporting subset ranking using binary classifications, and derive novel instructive conclusions that
Reducing Position-Sensitive Subset Ranking to Classification
407
extend and improve the existing reduction approaches for subset ranking. The potential usefulness of theory is validated through experiments on a benchmark data set for learning to rank. Acknowledgments. This work was supported partially by NNSFC 60921061.
References 1. Ailon, N., Mohri, M.: An efficient reduction of ranking to classification. In: Proc. 21st COLT, pp. 87–98 (2008) 2. Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., Sorkin, G.B.: Robust reductions from ranking to classification. In: Bshouty, N.H., Gentile, C. (eds.) COLT. LNCS (LNAI), vol. 4539, pp. 604–619. Springer, Heidelberg (2007) 3. Beygelzimer, A., Dani, V., Hayes, T., Langford, J., Zadrozny, B.: Error limiting reductions between classification tasks. In: Proc. 22nd ICML, pp. 49–56 (2005) 4. Beygelzimer, A., Langford, J., Ravikumar, P.: Error-correcting tournaments. In: Gavald` a, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 247–262. Springer, Heidelberg (2009) 5. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proc. 22nd ICML, pp. 89–96 (2005) 6. Burges, C.J.C., Ragno, R., Le, Q.V.: Learning to rank with non-smooth cost functions. In: Proc. 19th NIPS, pp. 193–200. MIT Press, Cambridge (2006) 7. Cortes, C., Mohri, M., Rastogi, A.: Magnitude-preserving ranking algorithms. In: Proc. 24th ICML, pp. 169–176 (2007) 8. Cossock, D., Zhang, T.: Statistical analysis of bayes optimal subset ranking. IEEE Transactions on Information Theory 54, 5140–5154 (2008) 9. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 933–969 (2003) 10. Herbrich, R., Graepel, T., Obermayer, K.: Support vector learning for ordinal regression. In: Proc. 9th ICANN, pp. 97–102 (1999) 11. J¨ arvelin, K., Kek¨ al¨ ainen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20, 422–446 (2002) 12. Joachims, T.: Optimizing search engines using clickthrough data. In: Proc. 8th KDD, pp. 133–142. ACM Press, New York (2002) 13. Kendall, M.G.: A new measure of rank correlation. Biometrika 30, 81–93 (1938) 14. Langford, J., Beygelzimer, A.: Sensitive error correcting output codes. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 158–172. Springer, Heidelberg (2005) 15. Le, Q.V., Smola, A.J.: Direct optimization of ranking measures. In: CoRR, abs/0704.3359 (2007) 16. Marshall, A., Olkin, I.: Inequalities: Theory of majorization and its applications. Mathematics in Science and Engineering, vol. 143 (1979) 17. Asia, M.R.: Letor3.0: benchmark datasets for learning to rank. Microsoft Corporation (2008) 18. Sun, Z.Y., Qin, T., Tao, Q., Wang, J.: Robust sparse rank learning for non-smooth ranking measures. In: Proc. 32nd SIGIR, pp. 259–266 (2009) 19. Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. The Annuals of Statistics 32, 56–85 (2004)
Intelligent Software Development Environments: Integrating Natural Language Processing with the Eclipse Platform Ren´e Witte, Bahar Sateli, Ninus Khamis, and Juergen Rilling Department of Computer Science and Software Engineering Concordia University, Montr´eal, Canada
Abstract. Software engineers need to be able to create, modify, and analyze knowledge stored in software artifacts. A significant amount of these artifacts contain natural language, like version control commit messages, source code comments, or bug reports. Integrated software development environments (IDEs) are widely used, but they are only concerned with structured software artifacts – they do not offer support for analyzing unstructured natural language and relating this knowledge with the source code. We present an integration of natural language processing capabilities into the Eclipse framework, a widely used software IDE. It allows to execute NLP analysis pipelines through the Semantic Assistants framework, a service-oriented architecture for brokering NLP services based on GATE. We demonstrate a number of semantic analysis services helpful in software engineering tasks, and evaluate one task in detail, the quality analysis of source code comments.
1
Introduction
Software engineering is a knowledge-intensive task. A large amount of that knowledge is embodied in natural language artifacts, like requirements documents, user’s guides, source code comments, or bug reports. While knowledge workers in other domains now routinely make use of natural language processing (NLP) and text mining algorithms, software engineers still have only limited support for dealing with natural language artifacts. Existing software development environments (IDEs) can only handle syntactic aspects (e.g., formatting comments) and some basic forms of analysis (e.g., spell-checking). More sophisticated NLP analysis tasks have been proposed for software engineering, but so far have not been integrated with common software IDEs and therefore not been widely adopted. In this paper, we argue that software engineers can benefit from modern NLP techniques. To be successfully adopted, this NLP must be seamlessly integrated into the software development process, so that it appears alongside other software analysis tasks, like static code analysis or performance profiling. As software engineers are end users, not experts in computational linguistics, NLP services must be presented at a high level of abstraction, without exposing the details of language analysis. We show that this kind of NLP can be brought to software engineers in a generic fashion through a combination of modern software C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 408–419, 2011. c Springer-Verlag Berlin Heidelberg 2011
Intelligent Software Development Environments: Integrating NLP with Eclipse
409
engineering and semantic computing approaches, in particular service-oriented architectures (SOAs), semantic Web services, and ontology-based user and context models. We implemented a complete environment for embedding NLP into software development that includes a plug-in for the Eclipse1 framework, allowing a software engineer to run any analysis pipeline deployed in GATE [1] through the Semantic Assistants framework [2]. We describe a number of use cases for NLP in software development, including named entity recognition and quality analysis of source code comments. An evaluation with end users shows that these NLP services can support software engineers during the software development process. Our work is significant because it demonstrates, for the first time, how a major software engineering framework can be enhanced with natural language processing capabilities and how a direct integration of NLP analysis with code analysis can provide new levels of support for software development. Our contributions include (1) a ready-to-use, open source plug-in to integrate NLP services into the Eclipse software development environment (IDE); (2) novel NLP services suitable for interactive execution in a software engineering scenario; and (3) an evaluation of a software comment quality assurance service demonstrating the usefulness of NLP services, evaluated against annotations manually created by a large group of software engineering students.
2
Software Engineering Background
From a software engineer’s perspective, natural language documentation contains valuable information of both functional and non-functional requirements, as well as information related to the application domain. This knowledge often is difficult or impossible to extract only from source code [3]. One of our application scenarios is the automation of source code comment quality analysis, which so far has to be performed manually. The motivation for automating this task arises from the ongoing shift in development methodologies from a document-driven (e.g., waterfall model) towards agile development (e.g., Scrum). This paradigm shift leads to situations where the major documentation, such as software requirements specifications or design and implementation decisions, are only available in form of source code comments. Therefore, the quality of this documentation becomes increasingly important for developers attempting to perform the various software engineering and maintenance tasks [4]. Any well-written computer program should contain a sufficient number of comments to permit people to understand it. Without documentation, future developers and maintainers are forced to make dangerous assumptions about the source code, scrutinizing the implementation, or even interrogating the original author if possible [5]. Development programmers should prepare these comments when they are coding and update them as the programs change. There exist different types of guidelines for in-line documentation, often in the form 1
Eclipse, http://www.eclipse.org/
410
R. Witte et al.
of programming standards. However, a quality assurance for these comments, beyond syntactic features, currently has to be performed manually.
3
Design of the NLP/Eclipse Integration
We start the description of our work by discussing the requirements and design decisions for integrating NLP with the Eclipse platform. 3.1
Requirements
Our main goal is to bring NLP to software engineers, by embedding it into a current software development environment used for creating, modifying, and analysing source code artifacts. There are a number of constraints for such an integration: It must be possible to use NLP on existing systems without requiring extensive re-installations or -configurations on the end user’s side; it must be possible to execute NLP services remotely, so that it is not necessary to install heavy-weight NLP tools on every system; the integration of new services must be possible for language engineers without requiring extensive system knowledge; it must be generic, i.e., not tied to a concrete NLP service, so that new services can be offered by the server and dynamically discovered by the end user; and the services must be easy to execute from an end user’s perspective, without requiring knowledge of NLP or semantic technologies. Our solution to these requirements is a separation of concerns, which directly addresses the skill-sets and requirements of computational linguists (developing new NLP analysis pipelines), language engineers (integrating these services), and end users (requesting these services). The Web service infrastructure for brokering NLP services has been previously implemented in the open source Semantic Assistants architecture [2] (Fig 1). Developing new client plug-ins is one of the extension points of the Semantic Assistants architecture, bringing further semantic support to commonly used tools. Here, we chose the Eclipse platform, which is a major software development framework used across a multitude of languages, but the same ideas can be implemented in other IDEs (like NetBeans). 3.2
An Eclipse Plug-in for NLP
Eclipse is a multi-language software development environment, comprising an IDE and an extensible plug-in system. Eclipse is not a monolithic program but rather a small kernel that employs plug-ins in order to provide all of its functionality. The main requirements for an NLP plug-in are: (1) a GUI integration that allows users to enquire about available assistants and (2) execute a desired NLP service on a set of files or even complete projects inside the workspace, without interrupting the user’s task at hand. (3) On each enquiry request, a list of NLP services relevant to the user’s context must be dynamically generated and presented to the user. The user does not need to be concerned about making any changes on the client-side – any new NLP service existing in the
Intelligent Software Development Environments: Integrating NLP with Eclipse Tier 1: Clients
Tier 2: Presentation and Interaction
Tier 3: Analysis and Retrieval
411
Tier 4: Resources
Plugin
Eclipse
NLP Service Connector
Service Invocation
Web Server
Plugin
OpenOffice.org Writer
New application
Client Side Abstraction Layer
Service Information
NLP Subsystem Language Services
Language Service Descriptions
Information Extraction Web/IS Connector
Automatic Summarization External
Web Information System
Question Answering
Presentation
Index Generation
Navigation
Information Retrieval
Annotation
Documents
Indexed Documents
Fig. 1. The Semantic Assistants architecture, brokering NLP pipelines through Web services to connected clients, including the Eclipse client described here
project resources must be automatically discovered through its OWL metadata, maintained by the architecture. Finally, (4) NLP analysis results must be presented in a form that is consistent with the workflow and visualization paradigm in a software IDE; e.g., mapping detected NL ‘defects’ to the corresponding line of code in the editor, similar to code warnings displayed in the same view. 3.3
Converting Source Code into an NLP Corpus
A major software engineering artifact is source code. If we aim to support NLP analysis in the software domain, it must be possible to process source code using standard NLP tools, e.g., in order to analyze comments, identifiers, strings, and other NL components. While it is technically possible to load source code into a standard NLP tool, the unusual distribution of tokens will have a number of side-effects on standard analysis steps, like part-of-speech tagging or sentence splitting. Rather than writing custom NLP tools for the software domain, we propose to convert a source code file into a format amenable for NLP tools. In the following, we focus on Java due to space restrictions, but the same ideas apply to other programming languages as well. To convert Java source code into a standard representation, it is possible to apply a Java fact extraction tool such as JavaML, Japa, or JavaCC and transform the output into the desired format. The tool that provides the most information regarding the constructs found in Javadoc comments [6] is the Javadoc tool. Javadoc’s standard doclet generates API documentation using the HTML format. While this is convenient for human consumption, automated NLP analysis applications require a more structured XML format. When loading HTML documents generated using the standard doclet into an NLP framework (Fig. 2, left), the elements of an HTML tag are interpreted as being entities of an annotation. For example, the Java package (org.argouml.model) is interpreted as being of the type h2. This is because the Javadoc standard doclet extraction tool marked up the package using the tags. As a result, additional processing is required in order to
412
R. Witte et al.
Fig. 2. Javadoc generated documentation loaded within an NLP Framework
identify the entity as being a package. In contrast, an XML document (Fig. 2, right), where the elements of the XML tags coincide with the encapsulated entity, clearly identifies them as being a Package, Class, etc. For transforming the Javadoc output into an XML representation, we designed a doclet capable of generating XML documents. The SSL Javadoc Doclet [7] converts class, instance variable, and method identifiers and Javadoc comments into an XML representation, thereby creating a corpus that NLP services can analyse easier.
4
Implementation
The Semantic Assistants Eclipse plug-in has been implemented as a Java Archive (JAR) file that ships with its own specific implementation and an XML description file that is used to introduce the plug-in to the Eclipse plug-in loader. The plug-in is based on the Model-View-Controller pattern providing a flexibility towards presenting annotations to the user generated from various NLP services. The user interaction is realized through using the Eclipse Standard Widget Toolkit and service invocations are implemented as Eclipse Job instances allowing the asynchronous execution of language services. On each invocation of an NLP service, the plug-in connects to the Semantic Assistants server through the Client-Side Abstraction Layer (CSAL) utility classes. Additional input dialogues are presented to the user to provide NLP service run-time parameters after interpreting the OWL metadata of the selected service. Then, the execution will be instantiated as a job, allowing the underlying operating system to schedule and manage the lifecycle of the job. As the execution of the job is asynchronous and running in the background (if so configured by the user), two Eclipse view parts will be automatically opened to provide real-time logs and the retrieved annotations once NLP analysis is completed. Eventually, after a successful execution of the selected NLP service, a set of retrieved results is presented to the user in a dedicated ‘Semantic Assistants’ view part. The NLP annotations are contained inside dynamically generated tables, presenting one annotation instance per row providing a one-to-one mapping of annotation instances to entities inside the software artifacts. The plug-in also offers additional, Eclipse-specific features. For instance, when executing source code related NLP services, special markers are dynamically generated to attach annotation instances to the corresponding document (provided the invocation
Intelligent Software Development Environments: Integrating NLP with Eclipse
413
results contain the position of the generated annotations in the code). This offers a convenient way for users to navigate directly from annotation instances in the Semantic Assistants view to the line of code in the project where it actually belongs, in the same fashion as navigating from compiler warnings and errors to their location in the code.
5
Applications: NLP in Software Development
In this section, we discuss application examples, showing how software engineers can benefit from integrated NLP services. One of them, the quality analysis of source code comments, is presented with a detailed evaluation. 5.1
Working with NLP Services in Eclipse
Once the Semantic Assistants plug-in is successfully installed, users can start using the NLP services directly from the Eclipse environment on the resources available within the current workspace. One of the features of our plug-in is a new menu entry in the standard Eclipse toolbar:
This menu entry allows a user to enquire about available NLP services related to his context. Additionally, users can manually configure the connection to the Semantic Assistants server, which can run locally or remote. Upon selecting the ‘Available Assistants’ option, the plug-in connects to the Semantic Assistants server and retrieves the list of available language services generated by the server through reading the NLP service OWL metadata files. Each language service has a name and a brief description explaining what it does. The user then selects individual files or even complete projects as input resources, and finally the relevant NLP service to be executed. The results of a successful service invocation are shown to the user in an Eclipse view part called “Semantic Assistants”. In the mentioned view, a table will be generated dynamically based on the server response that contains all the parsed annotation instances. For example, in Fig. 5, the JavadocMiner service has been invoked on a Java source code file. Some of the annotations returned by the server bear a lineNumber feature, which attaches an annotation instance to a specific line in the Java source file. After double-clicking on the annotation instance in the Semantic Assistants view, the corresponding resource (here, a .java file) will be opened in an editor and an Eclipse warning marker will appear next to the line defined by the annotation lineNumber feature.
414
5.2
R. Witte et al.
Named Entity Recognition
The developed plug-in allows to execute any NLP pipeline deployed in GATE, not just software engineering services. For example, standard information extraction (IE) becomes immediately available to software developers. Fig. 4 shows a sample result set of an ANNIE invocation, a named entity recognition service running on the licensing documentation of a Java class. ANNIE can extract various named entities such as Person, Organization, or Location. Here, each row in the table represents a named entity and its corresponding resource file and bears the exact offset of the entity inside the textual data so it can be easily located. NE recognition can allow a software engi- Fig. 3. Semantic Assistants Invocation dialogue in neer to quickly locate im- Eclipse, selecting artifacts to send for analysis portant concepts in a software artifact, like the names of developers, which is important for a number of tasks, including traceability link analysis. 5.3
Quality Analysis of Source Code Comments
The goal of our JavadocMiner tool [4] is to enable users to automatically assess the quality of source code comments. The JavadocMiner is also capable of providing users with recommendations on how a Javadoc comment may be improved
Fig. 4. Retrieved NLP Annotations from the ANNIE IE Service
Intelligent Software Development Environments: Integrating NLP with Eclipse
415
based on the “How to Write Doc Comments for the Javadoc Tool” guidelines.2 Directly integrating this tool with the Eclipse framework now allows software engineers to view defects in natural language in the same way as defects in their code. In-line Documentation and Javadoc. Creating and maintaining documentation has been widely considered as an unfavourable and labour-intensive task within software projects [8]. Documentation generators currently developed are designed to lessen the efforts needed by developers when documenting software, and have therefore become widely accepted and used. The Javadoc tool [6] provides an inter-weaved representation where documentation is directly inserted into Java source code in the form of comments that are ignored by compilers. Different types of comments are used to document the different types of identifiers. For example, a class comment should provide insight on the high-level knowledge of a program, e.g., which services are provided by the class, and which other classes make use of these services [9]. A method comment, on the other hand, should provide a low-level understanding of its implementation. When writing comments for the Javadoc tool, there are a number of guideline specifications that should be followed to ensure high quality comments. The specifications include details such as: (1) Use third person, declarative, rather than second person, prescriptive; (2) Do not include any abbreviations when writing comments; (3) Method descriptions need to begin with verb phrases; and (4) Class/interface/field descriptions can omit the subject and simply state the object. These guidelines are well suited for automation through NLP analysis. Automated Comment Quality Analysis. Integrating the JavadocMiner with our Eclipse plug-in provides for a completely new style of software development, where analysis of natural language is interweaved with analysis of code. Fig. 5, shows an example of an ArgoUML3 method doesAccept loaded within the Eclipse IDE. After analyzing the comments using the JavadocMiner, the developer is made aware of some issues regarding the comment: (1) The PARAMSYNC metric detected an inconsistency between the Javadoc @param annotation and the method parameter list: The developer should modify the annotation to begin with the name of the parameter being documented, “objectToAccept” instead of “object” as indicated in PARAMSYNC Explanation. (2) The readability metrics [4] detected the Javadoc comment as being below the Flesch threshold FLESCHMetric and FleschExplanation, and above the Fog threshold FOGMetric and FOGExplanation, which indicates a comment that exceeds the readability thresholds set by the user. (3) Because the comment does not use a third person writing style as stated in guideline (1), the JavadocMiner generates a recommendation MethodCommentStyle that explains the steps needed in order for the comment to adhere to the Javadoc guidelines. 2 3
http://oracle.com/technetwork/java/javase/documentation/ index-137868.html ArgoUML, http://argouml.tigris.org/
416
R. Witte et al.
Fig. 5. NLP analysis results on a ArgoUML method within Eclipse
End-User Evaluation. We performed an end-user study to compare how well automated NLP quality analysis in a software framework can match human judgement, by comparing the parts of the in-line documentation that were evaluated by humans with the results of the Javadoc-Miner. For our case study, we asked 14 students from an undergraduate level computer science class (COMP 354), and 27 students from a graduate level software engineering course (SOEN 6431) to evaluate the quality of Javadoc comments taken from the ArgoUML open source project [10]. For our survey, we selected a total of 110 Javadoc comments: 15 class and interface comments, 8 field comments, and 87 constructor and method comments. Before participating in the survey, the students were asked to review the Javadoc guidelines discussed earlier. The students had to log into the free online survey tool Kwik Surveys 4 using their student IDs, ensuring that all students completed the survey Fig. 6. A Sample Question from the Survey only once. The survey included a set of general questions such as the level of general (Table 1, left) and Java (Table 1, right) programming experience. The students were able to rate the comments as either Very Poor, Poor, Good, or Very Good as shown in Fig. 6, giving the comments a 50% chance of being positively or negatively classified. This also enabled us to know how strongly the participants felt about their sentiments, compared to using just a Good or Bad 4
Kwik Surveys, http://www.kwiksurveys.com/
Intelligent Software Development Environments: Integrating NLP with Eclipse
417
Table 1. Years of general and Java programming experience of study participants General Experience Java Experience Class 0 Years 1-2 Years 3+ Years 0 Years 1-2 Years 3+ Years COMP 354 11% 31% 58% 7% 61% 32% SOEN 6431 02% 22% 76% 10% 49% 41%
selection. From the 110 manually assessed comments, we selected a total of 67 comments: 5 class and interface comments, 2 field comments, and 60 constructor and method comments, that had strong agreement (≥ 60%) as being of either good (39 comments) or bad (28 comments) quality. When comparing the student evaluation of method comments with some of the NL measures of the JavadocMiner (Table 2), we found that the comments that were evaluated negatively contained half as many words (14) compared to the comments that were evaluated as being good. Regardless of the insufficient documentation of the bad comments, the readability index of Flesch, Fog and Kincaid indicated text that contained a higher density, or more complex material, which the students found hard to understand. All of the methods in the survey contained parameter lists that needed to be documented using the @param annotation. When analysing the results of the survey, we found that most students failed to analyze the consistency between the code and comFig. 7. A Sample Answer from the Survey ments as shown in Fig. 7. Our JavadocMiner also detected a total of 8 abbreviations being used within comments, that none of the students mentioned. Finally, for twelve of the 39 comments that were analyzed by the students as being good, 12 of them were not written in third-person according to the guidelines, a detail that all students also failed to mention.
6
Related Work
We are not aware of similar efforts for bringing NLP into the realm of software development by integrating it tightly with a software IDE. Some previous works exist on NLP for software artifacts. Most of this research has focused on analysing texts at the specification level, e.g., in order to automatically convert use case descriptions into a formal representation [11] or detect inconsistent requirements [12]. In contrast, we aim to support the roles of software developer, maintainer, and quality assurance engineer.
418
R. Witte et al. Table 2. Method Comments Evaluated by Students and the JavadocMiner Student Evaluation Avg. Number of Words Avg. Flesch Avg. Fog Avg. Kincaid Good 28.03 39.2 12.63 10.55 Bad 14.79 5.58 13.98 12.66
There has been effort in the past that focused on analyzing source code comments; For example, in [13] human annotators were used to rate excerpts from Jasper Reports, Hibernate and jFreeChart as being either More Readable, Neutral or Less Readable, as determined by a “Readability Model”. The authors of [14] manually studied approximately 1000 comments from the latest versions of Linux, FreeBSD and OpenSolaris. The work attempts to answer questions such as 1) what is written in comments; 2) whom are the comments written for or written by; 3) where the comments are located; and 4) when the comments were written. The authors made no attempt to automate the process. Automatically analyzing comments written in natural language to detect codecomment inconsistencies was the focus of [15]. The authors explain that such inconsistencies may be viewed as an indication of either bugs or bad comments. The author’s implement a tool called iComment that was applied on 4 large Open Source Software projects: Linux, Mozilla, Wine and Apache, and detected 60 comment-code inconsistencies, 33 new bugs and 27 bad comments. None of the works mentioned in this section attempted to generalize the integration of NLP analysis into the software development process, which is a major focus of our work.
7
Conclusions and Future Work
We presented a novel integration of NLP into software engineering, through a plug-in for the Eclipse platform that allows to execute any existing GATE NLP pipeline (like the ANNIE information extraction system) through a Web service. The Eclipse plug-in, as well as the Semantic Assistants architecture, is distributed as open source software.5 Additionally, we presented an example NLP service, automatic quality assessment of source code comments. We see the importance of this work in two areas: First, we opened up the domain of NLP to software engineers. While some existing work addressed analysis services before, they have not been adopted in software engineering, as they were not integrated with common software development tools and processes. And second, we demonstrate the importance of investigating interactive NLP, which so far has received less attention than the typical offline corpus studies. Our case study makes a strong case against a human’s ability to manage the various aspects of documentation quality without (semi-)automated help of NLP tools such as the JavadocMiner. By embedding NLP within the Eclipse IDE, developers need to spend less efforts when analyzing their code, which we believe will lead to a wider adoption of NLP in software engineering. 5
See http://www.semanticsoftware.info/semantic-assistantseclipse-plugin
Intelligent Software Development Environments: Integrating NLP with Eclipse
419
Acknowledgements. This research was partially funded by an NSERC Discovery Grant. The JavadocMiner was funded in part by a DRDC Valcartier grant (Contract No. W7701-081745/001/QCV).
References 1. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Annual Meeting of the ACL (2002) 2. Witte, R., Gitzinger, T.: Semantic Assistants – User-Centric Natural Language Processing Services for Desktop Clients. In: Domingue, J., Anutariya, C. (eds.) ASWC 2008. LNCS, vol. 5367, pp. 360–374. Springer, Heidelberg (2008) 3. Lindvall, M., Sandahl, K.: How well do experienced software developers predict software change? Journal of Systems and Software 43(1), 19–27 (1998) 4. Khamis, N., Witte, R., Rilling, J.: Automatic Quality Assessment of Source Code Comments: The JavadocMiner. In: Hopfe, C.J., Rezgui, Y., M´etais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 68–79. Springer, Heidelberg (2010) 5. Kotula, J.: Source Code Documentation: An Engineering Deliverable. In: Int. Conf. on Technology of Object-Oriented Languages, p. 505. IEEE Computer Society, Los Alamitos (2000) 6. Kramer, D.: API documentation from source code comments: a case study of Javadoc. In: SIGDOC 1999: Proceedings of the 17th Annual International Conference on Computer Documentation, pp. 147–153. ACM, New York (1999) 7. Khamis, N., Rilling, J., Witte, R.: Generating an NLP Corpus from Java Source Code: The SSL Javadoc Doclet. In: New Challenges for NLP Frameworks, Valletta, Malta, ELRA, May 22, pp. 41–45 (2010) 8. Brooks, R.E.: Towards a Theory of the Comprehension of Computer Programs. International Journal of Man-Machine Studies 18(6), 543–554 (1983) 9. Nurvitadhi, E., Leung, W.W., Cook, C.: Do class comments aid Java program understanding? In: Frontiers in Education (FIE), vol. 1 (November 2003) 10. Bunyakiati, P., Finkelstein, A.: The Compliance Testing of Software Tools with Respect to the UML Standards Specification - The ArgoUML Case Study. In: Dranidis, D., Masticola, S.P., Strooper, P.A. (eds.) AST, pp. 138–143. IEEE, Los Alamitos (2009) 11. Mencl, V.: Deriving behavior specifications from textual use cases. In: Proceedings of Workshop on Intelligent Technologies for Software Engineering, pp. 331–341. Oesterreichische Computer Gesellschaft, Linz (2004) 12. Kof, L.: Natural language processing: Mature enough for requirements documents analysis? In: Montoyo, A., Mu´ noz, R., M´etais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 91–102. Springer, Heidelberg (2005) 13. Buse, R.P.L., Weimer, W.R.: A metric for software readability. In: Proc. Int. Symp. on Software Testing and Analysis (ISSTA), New York, NY, USA, pp. 121–130 (2008) 14. Padioleau, Y., Tan, L., Zhou, Y.: Listening to programmers Taxonomies and characteristics of comments in operating system code. In: ICSE 2009, pp. 331–341. IEEE Computer Society, Washington, DC (2009) 15. Tan, L., Yuan, D., Krishna, G., Zhou, Y.: /*icomment: bugs or bad comments?*/. In: SOSP 2007: Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, pp. 145–158. ACM, New York (2007)
Partial Evaluation for Planning in Multiagent Expedition Y. Xiang and F. Hanshar University of Guelph, Canada
Abstract. We consider how to plan optimally in a testbed, multiagent expedition (MAE), by centralized or distributed computation. As optimal planning in MAE is highly intractable, we investigate speedup through partial evaluation of a subset of plans whereby only the intended effect of a plan is evaluated when certain conditions hold. We apply this technique to centralized planning and demonstrate significant speedup in runtime while maintaining optimality. We investigate the technique in distributed planning and analyze the pitfalls.
1
Introduction
We consider a class of stochastic multiagent planning problems termed multiagent expedition (MAE) [8]. A typical instance consists of a large open area populated by objects as well as mobile agents. Agent activities include moving around the area, avoiding dangerous objects, locating objects of interest, and object manipulation depending on the nature of the application. Successful manipulation of an object may require proper actions of a single agent or may require cooperation of multiple agents coordinating through limited communication. Success of an agent team is evaluated based on the quantity of objects manipulated as well as the quality of each manipulation. MAE is an abstraction of practical problems such as planetary expedition or disaster rescue [3]. Planning in MAE may be achieved by centralized or distributed computation. Its centralized version can be shown to be a partially observable Markov decision process (POMDP) and its distributed version can be shown to be a decentralized POMDP (DEC-POMDP). A number of techniques have been proposed for solving POMDPs [4,6]. The literature for DEC-POMDPs is growing rapidly, e.g., [1,5]. Optimal planning is highly intractable in general for either POMDP or DEC-POMDP. Inspired by branch-and-bound techniques to improve planning efficiency [2], we propose a method partial evaluation that focuses on the intended effect of a plan and skips evaluation of unintended effects when certain conditions are met. We focus on on-line planning. We experiment with partial evaluation for centralized planning in MAE and demonstrate a significant speedup in runtime while maintaining plan optimality. We also examine its feasibility in distributed planning. It is found to be limited by local optimality without guaranteed global optimality or intractable agent communication. This result yields insight into distributed planning that suggests future research on approximate planning. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 420–432, 2011. c Springer-Verlag Berlin Heidelberg 2011
Partial Evaluation for Planning in Multiagent Expedition
421
The remainder of the paper is organized as follows: Section 2 reviews background on MAE. Sections 3-6 present partial evaluation for centralized planning with experimental results reported in Section 7. Section 8 first reviews background on collaborative design networks (CDNs), a multiagent graphical model for distributed decision making, and then investigates partial evaluation for distributed planning based on CDNs.
2
Background on Multiagent Expedition
In MAE, an open area is represented as a grid of cells (Figure 1 (a)). At any cell, an agent can move to an adjacent cell by actions north, south, east, west or remain there (halt). An action has an intended effect (e.g., north in Figure 1 (d)) and a number of unintended effects (other outcomes in (d)), quantified by transition probabilities. 1
reward
0.9
0
(a)
(b)
0.025
0.025 0.025
1
2
3
# agents
4
0.025
(c)
(d)
Fig. 1. a) Grid of cells and reward distribution in MAE. b) Cell reward distribution. c) Agent’s perceivable area. d) Intended effect (arrow) of action north.
The desirability of a cell is indicated by a numerical reward. A neutral cell has a reward of a base value β. The reward at a harmful cell is lower than β. The reward at an interesting cell is higher than β and can be further increased through agent cooperation. When a physical object is manipulated (e.g., by digging), cooperation is often most effective when a certain number of agents are involved, and the per-agent productivity is reduced with more or less agents. We denote the most effective level by λ. Figure 1(b) shows the reward distribution of a single cell with λ = 2. At this cell, the reward collected by a single agent is 0.3, if two agents cooperate at the cell, each receives 0.8. Reward decreases with more than λ agents, promoting only effective cooperations. After a cell has been visited by any agent, its reward is decreased to β. As a result, wandering within a neighbourhood is unproductive. Agents have no prior knowledge how rewards are distributed in the area. Instead, at any cell, an agent can reliably perceive its location and reward distribution within a small radius (e.g. shaded cells in Figure 1(c)). An agent can also perceive the location of another agent and communicate if the latter is within a given radius. Each agent’s objective is to move around the area, cooperate as needed, and maximize the team reward over a finite horizon based on local observations and
422
Y. Xiang and F. Hanshar
limited communication. For a team of n agents and horizon h, there are 5nh joint plans each of which has 5nh possible outcomes. With n = 6 and h = 2, a total of 524 ≈ 6 × 1016 uncertain outcomes need evaluated. Hence, solving MAE optimally is highly intractable. In the following, we refer to maximization of reward and utility interchangeably with the following assumption: A utility is always in [0, 1], no matter if it is the utility of an action, or a plan (a sequence of actions), or an joint action (simultaneous actions by multiple agents), or a joint plan (a sequence of joint actions). In each case, the utility is mapped linearly from [min reward, max reward], with min reward and max reward properly defined accordingly.
3
Partial Evaluation
We study how to speedup planning in the context of MAE, based on an idea: partial evaluation. Let a be an action with two possible outcomes: an intended and an unintended. The intended outcome has the probability p1 and utility u1 , and the unintended p2 = 1 − p1 and u2 , respectively. Its expected utility is evaluated as eu = p1 u1 + p2 u2 .
(1)
Let a be an alternative action with the same outcome probabilities p1 (for intended) and p2 , and utilities u3 and u4 , respectively. Its expected utility is eu = p1 u3 + p2 u4 . The alternative action a is dominated by a if eu − eu = eu − p1 u3 − p2 u4 > 0.
(2)
From Eqn (2), the following holds: u3 <
eu p2 − u4 p1 p1
(3)
Letting umax denote the maximum utility achievable, we have eu p2 eu p2 − umax ≤ − u4 . p1 p1 p1 p1
(4)
Eqn (3) is guaranteed to hold if we maintain u3 <
eu 1 − p1 − umax ≡ t. p1 p1
(5)
When the number of alternative actions is large, the above idea can be used to speed up search for best action: For an unevaluated action a , if u3 satisfies Eqn (5), discard a . We say that a is partially evaluated. Otherwise, eu will be fully evaluated. If eu exceeds eu, then a will be updated as a , eu updated as eu , and u1 will be updated as u3 . Eqn (5) allows more efficient search without losing optimality, and is an exact criterion for partial evaluation. The actual speed-up depends on the threshold t for u3 . The larger the value of t, the less actions that must be fully evaluated, and the more efficient the search.
Partial Evaluation for Planning in Multiagent Expedition
423
Consider the value of umax . When utility is bounded by [0, 1], we have the obvious option umax = 1. That is, we derive umax from the global utility distribution over all outcomes of actions. Threshold t increases as umax decreases. Hence, it is desirable to use a smaller umax while maintaining Eqn (4). One option to achieve this is to use umax from the local utility distribution over only outcomes of current alternative actions. The trade-off is the following: With umax = 1, it is a constant. With the localized umax , it must be updated before each planning.
4
Single-Agent Expedition
In single-agent expedition, an action a has an intended outcome and four unintended ones. We assume that the intended outcome of all actions have the same probability p1 , and unintended outcomes have the same probability (1 − p1 )/4. Hence, we have eu = p1 u1 +
4
u2,i (1 − p1 )/4,
(6)
i=1
where u2,i is the utility of the ith unintended outcome. Comparing Eqn (1) and Eqn (6) , we have p2 u2 =
4
1 − p1 1 = (1 − p1 ) ( u2,i ). 4 4 i=1 4
u2,i
i=1
If we aggregate the four unintended outcomes as an equivalent single unintended outcome, then this outcome has probability p2 = 1 − p1 and utility 4 u2 = 14 i=1 u2,i . Let uamax (where ‘a’ in ‘ua’ refers to ‘agent’) denote the maximum utility 4 of outcomes. Substituting u2 in Eqn (2) by 14 i=1 u2,i , repeating the analysis 4 after Eqn (2), and noting that 14 i=1 u2,i is upper-bounded by uamax , we have an exact criterion for partial evaluation: u3 < t =
1 1 − p1 eu − uamax p1 p1
(7)
As discussed in the last section, the smaller the 4value of uamax , the more efficient the search. Since uamax was replacing 14 i=1 u4,i (compare Eqns (3) 4 and (5)), we can alternatively replace 14 i=1 u4,i with an upper bound tighter 4 than uamax . Since 14 i=1 u4,i is essentially the average utility over unintended outcomes, we can replace uamax by α uaavg , where uaavg is the average (local) utility of outcomes and α ≥ 1 is a scaling factor. This yields the following: u3 < t =
1 1 − p1 eu − α uaavg p1 p1
(8)
According to Chebyshev’s inequality, the smaller the variance of utilities over outcomes, the closer to 1 the α value can be without losing planning optimality.
424
5
Y. Xiang and F. Hanshar
Single Step MAE by Centralized Planning
Next, we consider multiagent expedition with n agents. Each agent action has k alternative outcomes o1 , ..., ok , where o1 is the intended with probability p. A joint action by n agents consists of a tuple of n individual actions and is denoted by a. The intended outcome of a is the tuple made of the intended outcomes of individual actions, and is unique. We denote the utility of the intended outcome of a by u. Outcomes of individual agent actions are independent of each other given the joint action plan. Hence, the intended outcome of a has probability pn . The expected utility of a is eu = pn u +
p i ui ,
(9)
i
where i indexes unintended outcomes, ui is the utility of an unintendedoutcome, and pi is its probability. Note that pi = pj in general for i = j, and pn + i pi = 1. Let a be an alternative joint action whose intended outcome has utility u . Denote the expected utility of a by eu . The joint action a is dominated by joint action a if eu − eu = eu − pn u −
pi ui > 0.
(10)
i
Eqn (10) can be rewritten as follows: u < p−n (eu −
i
pi ui )
Let utsavg (where ‘t’ in ‘uts’ refers to ‘team’ and ‘s’ refers to ‘single step’) denote the average utility of outcomes of joint actions. From pi pi 0 < pi < 1 − pn , 0 < 1−p n < 1, i 1−pn = 1,
pi ui = (1 − pn )
i
pi i 1−pn
i
pi u , 1 − pn i
ui
we have the expected value of (weighted mean with normalized weights) to be utsavg , and the expected value of i pi ui to be (1 − pn ) utsavg . We can choose α ≥ 1 (e.g. based on Chebyshev’s inequality) so that it is highly probable i pi ui ≤ (1 − pn ) α utsavg and hence eu − i pi ui ≥ eu − (1 − pn ) α utsavg . It then follows from Eqn (10) that the joint action a is dominated by a with high probability if the following holds, u < t =
eu 1 − pn − α utsavg , pn pn
(11)
in which case a can be discarded without full evaluation. Note that the condition is independent of k. In order to compute u by any agent Ag, it needs to know the intended outcome of the action in a for each other agent, and use this information to determine if any cooperation occurs in the intended outcome of a . To do so, it suffices for
Partial Evaluation for Planning in Multiagent Expedition
425
Ag to know the current location of each agent as well as a . Ag also needs to know the unilateral or cooperative reward associated with the intended outcome to calculate u . When other agents are outside of the observable area of Ag, this information must be communicated to Ag. Similarly, in order to compute utsavg , Ag needs to collect from other agents the average rewards in their local areas. Alternatively, following a similar analysis, we could base threshold t on utsmax , the maximum utility achievable by the outcome of any joint action, and test u by the following condition: u < t =
eu 1 − pn − utsmax pn pn
(12)
Since utsmax > α utsavg , the search is less efficient, but its probability to get the optimal plan is 1. To compute utsmax , Ag needs to collect from other agents the maximum rewards in their local areas, instead of average rewards as in the case of utsavg .
6
Multi-step MAE by Centralized Planning
Consider multiagent expedition with horizon h ≥ 2 (single step is equivalent to h = 1). Each agent selects a sequence a of h actions. The n agents collectively select a joint plan A (an n × h array). The intended outcome of joint plan A is made of the intended outcomes of all individual actions of all agents. Assume that the outcome of each individual action of each agent is independent of outcomes of its own past actions and is independent of outcomes of actions of other agents (as is the case in MAE). Then the probability of the intended outcome of joint plan A is phn . We denote the utility of the intended outcome of A by u. The expected utility of A is then eu = phn u +
pi ui ,
(13)
i
where i indexes unintended outcomes, ui is the utility of an unintended outcome, and pi is its probability. Note that phn + i pi = 1. Let A be an alternative joint plan whose intended outcome has utility u . Denote the expected utility of A by eu . The joint plan A is dominated by A if eu − eu = eu − phn u −
pi ui > 0.
(14)
i
Through an analysis similar to that in the last section, and from the similarity of Eqns (14) and (10), we can conclude the following: Let utmavg (where ‘m’ in ‘utm’ refers to ‘multi-step’) denote the average utility of outcomes of joint plans. Let α ≥ 1 to be a scaling factor. With a large enough α value, the joint plan A is dominated with high probability by plan A if the following inequation holds, u < t =
eu 1 − phn − α utmavg , hn p phn
in which case A can be discarded without full evaluation.
(15)
426
Y. Xiang and F. Hanshar
In order to compute u by any agent Ag, it needs to know A , the current location of each agent, and unilateral or cooperative reward associated with the intended outcomes. In order to compute utmavg , Ag needs to collect from other agents average rewards in their local areas. To increase the probability of plan optimality to 1, Ag can use the following test, with the price of less efficient search: u < t =
7
eu 1 − phn − utmmax hn p phn
(16)
Centralized Planning Experiment
The experiment aims to provide empirical evidence on efficiency gain and optimality of partial evaluation in multi-step MAE by centralized planning. Two MAE environments are used that differ in transition probability pt (0.8 or 0.9) for intended outcomes. Agent teams of size n = 3, 4 or 5 are run. The base reward β = 0.05. The most effective level of cooperation is set at λ = 2. Planning horizon is h = 2. Several threshold values from Section 6 are tested. The first, utmmax,1 = 1, corresponds to the global maximum reward. The second, utmmax , corresponds to the local maximum reward for each agent. The third, utmavg,α = α utmavg , corresponds to average reward over outcomes, scaled up by α. We report result for α = 1 as well as for a lower bound that yields an optimal plan by increasing α in 0.25 increments. Tables 1 and 2 show the result for different values of pt . Each row corresponds to an experiment run. F ull% refers to the percentage of plans fully evaluated. BF R denotes the team reward of the best joint plan found, and an asterisk indicates if the plan is optimal. BF R% denotes ratio of BF R over reward of optimal plan. T ime denotes runtime in seconds. The results show that partial evaluation based on utmmax,1 is conservative: all plans are fully evaluated in 4 out of 6 runs. Second, utmmax finds an optimal plan Table 1. Experiments with pt = 0.9 n Threshold utmmax,1 utmmax 3 utmavg,1 utmavg,3 utmmax,1 4 utmmax utmavg,1 utmmax,1 utmmax 5 utmavg,1 utmavg,5
Full%. 48.87 0.780 0.172 0.812 83.51 0.053 0.046 100 0.002 0.001 0.19
BFR 3.192* 3.192* 3.102 3.192* 4.940* 4.940* 4.940* 5.262* 5.046 5.046 5.262*
BFR% Time 100 3.3 100 0.3 97.18 0.1 100 0.3 100 142.6 100 2.5 100 1.9 100 4671.2 95.89 52.2 95.89 52.1 100 62.4
Table 2. Experiments with pt = 0.8 n Threshold utmmax,1 utmmax 3 utmavg,1 utmavg,3 utmmax,1 4 utmmax utmavg,1 utmmax,1 utmmax 5 utmavg,1 utmavg,4.5
Full%. 100 2.0 0.16 2.25 100 0.068 0.051 100 0.002 0.001 1.704
BFR 2.407* 2.407* 2.327 2.407* 3.630* 3.630* 3.630* 3.902* 3.745 3.745 3.902*
BFR% 100 100 96.67 100 100 100 100 100 95.97 95.97 100
Time 6.2 0.2 0.1 0.2 167.3 19.6 19.0 6479.5 53.5 52.3 136.0
Partial Evaluation for Planning in Multiagent Expedition
427
in 4 out of 6 runs, and utmavg,1 in 2 out of 6 runs. Third, partial evaluation based on utmmax and utmavg,α shows significant speedup on all runs. For example, with pt = 0.8, n = 5 and utmavg,α , an optimal plan is found when α = 4.5 and only 1.7% of joint plans are fully evaluated. The planning takes 136 seconds or 2% of the runtime (108min) by utmmax,1 which evaluates all plans fully. Table 3. Mean (μ) and standard deviation (σ) of team rewards over all plans n 3 4 5
# Plans pt μ σ pt μ σ 15,625 0.558 0.342 0.542 0.260 390,625 0.9 0.738 0.462 0.8 0.713 0.352 9,765,625 0.914 0.514 0.882 0.342
Table 3 shows the mean and standard deviation of team rewards over all joint plans for n = 3, 4 and 5, and pt = 0.8 and 0.9. The mean team reward in each case is no more than 23% of the corresponding optimal reward in Tables 1 and 2. For example, consider n = 5 and pt = 0.8, the optimal reward from Table 2 is 3.902 whereas the mean reward is 0.882, approximately 23% of the magnitude of the optimal plan. This signifies that the search space is full of low reward plans with very few good plans. Searching such a plan space is generally harder than a space full of high reward plans. The result demonstrates that partial evaluation is able to traverse the search space, skip full evaluation of many low reward plans, and find high reward plans. This is true even for relatively aggressive threshold utmavg,1 , achieving at least 95% of the optimal reward (see Table 2).
8 8.1
Partial Evaluation in Distributed Planning Collaborative Design Networks
Distributed planning in MAE can be performed based on multiagent graphical models, known as collaborative design networks (CDNs) [8], whose background is reviewed in this subsection. CDN is motivated by industrial design in supply chains. An agent responsible for a component encodes design knowledge into a design network (DN) S = (V, G, P ). The domain is a set of discrete variables V = D ∪ T ∪ M ∪ U . D is a set of design parameters. T is a set of environmental factors of the product under design. M is a set of objective performance measures and U is a set of subjective utility functions of the agent. Dependence structure G = (V, E) is a directed acyclic graph (DAG) whose nodes are mapped to elements of V and whose set E of arcs is from the following legal types: Arc (d, d ) (d, d ∈ D) signifies a design constraint. Arc (d, m) (m ∈ M ) represents dependency of performance on design. Arc (t, t ) (t, t ∈ T ) represents dependency between environmental factors. Arc (t, m) signifies dependency of performance on environment. Arc (m, m ) defines a composite performance measure. Arc (m, u) (u ∈ U ) signifies dependency of utility on performance.
428
Y. Xiang and F. Hanshar
P is a set of potentials, one for each node x, formulated as a probability distribution P (x|π(x)), where π(x) are parent nodes of x. P (d|π(d)), where d ∈ D, encodes a design constraint. P (t|π(t)) and P (m|π(m)), where t ∈ T, m ∈ M , are typical probability distributions. Each utility variable has a space {y, n}. P (u = y|π(u)) is a utility function util(π(u)) ∈ [0, 1]. Each node u is assigned a weight k ∈ [0, 1] where U k = 1. With P thus defined, x∈V \U P (x|π(x)) is a joint probability distribution (JPD) over D ∪ T ∪ M . Assuming additive independence among utility variables, the expected utility of a design d is EU (d) = k ( i i m ui (m)P (m|d)), where d (bold) is a configuration of D, i indexes utility nodes in U , m (bold) is a configuration of parents of ui , and ki is the weight of ui . Each supplier is a designer of a supplied component. Agents, one per supplier, form a collaborative design system. Each agent embodies a DN called a subnet and agents are organized into a hypertree: Each hypernode corresponds to an agent and its subnet. Each hyperlink (called an agent interface) corresponds to design parameters shared by the two subnets, which renders them conditionally independent. They are public and other subnet variables are private. The hypertree specifies to whom an agent communicates directly. Each subnet is assigned a weight wi , representing a compromise of preferences among agents, where i wi = 1. The collection of subnets {Si = (Vi , Gi , Pi )} forms a CDN. Figure 2 shows a trivial CDN for agents A0 , A1 , A2 . 00 s1 m2 m1 s0 s 0 s11 00 11 1 00 11 0 1 000 111 00 11 0 1 00 11 000 0 1 11 00 00 111 11 000 111 0 1 00 11 G2 11 00 G1 G0 00 11 000 111 000 111 0 d0 1 00 111 11 000 000 111 000 00 111 0 1 000 11 111 m m 4 111 000 3 00 11 0 00 11 000 111 u1 1 000 00u 2 111 11 u0 m0 00 11
A0 G0
s0
A2 G2
s1
G1 A1
Fig. 2. Subnets G0 , G1 , G2 (left) and hypertree (right) of a CDN. Design nodes are denoted by s if public and d if private, performance nodes by m, and utility nodes by u.
The product x∈V \∪i Ui P (x|π(x)) is a JPD over ∪i (Di ∪ Ti ∪ Mi ), where P (x|π(x)) is associated with node x in a subnet. The expected utility of a design d is EU (d) = i wi ( j kij ( m uij (m) P (m|d))), where d is a configuration of ∪i Di , i indexes subnets, j indexes utility nodes {uij } in ith subnet, m is a configuration of parents of uij , and kij is the weight associated with uij . Given a CDN, decision-theoretical optimal design is well defined. Agents evaluate local designs in batch before communicating over agent interfaces. An arbitrary agent is chosen as communication root. Communication is divided into collect and distribute stages. Collect messages propagate expected utility evaluation of local designs inwards along hypertree towards root. A receiving agent knows the best utility of every local configuration when extended by partial designs in downstream agents. At end of collect stage, the root agent knows the expected utility of the optimal design. Distribute messages propagate outwards along hypertree from root. After distribute stage, each agent has
Partial Evaluation for Planning in Multiagent Expedition
429
identified its local design that is globally optimal (collectively maximize EU (d)). Computation (incl. communication) is linear on the number of agents [7] and is efficient for sparse CDNs. 8.2
Distributed Per-plan Evaluation
We consider partial evaluation in distributed planning based on CDN. Each MAE agent uses a DN to encode its actions (moves) as design nodes, outcomes of actions as performance nodes, and rewards as utility nodes. The hypertree for a team of agents (A, B, C) and DN for agent B are shown in Figure 3. An agent only models and communicates with adjacent agents on hypertree. Movement nodes are labelled mv, performance nodes are labelled ps, and utility nodes are labelled rw. mv A,1
mvB,1
mv C,1
psB,1
psC,1
A
psA,1
rw B,C,1
rwB,A,1 (a)
GB
GA mvA,2
ps A,2 rw B,A,2
B
GB
C
mvB,2
GC
(b)
mvC,2
psC,2 psB,2 rwB,C,2
Fig. 3. (a) DN for MAE agent B. (b) Hypertree.
GA A
x 12
GB B
y 4
GC C
z 1
GD D
|D x|=2 |Dy|=3 |D z|=4
Fig. 4. Message collection where Dx is the domain of x
As shown earlier, partial evaluation relies on sequentially evaluating (fully or partially) individual joint plans. A distributed per-plan evaluation involves four technical issues: (1) How can a joint plan be evaluated fully? (2) How can it be evaluated partially? (3) As the root agent drives sequential per-plan evaluations, how can it know the total number of joint plans when it does not know other agents’ private variables? (4) When a given joint plan is being evaluated, how does each agent know which local plan should be evaluated when it does not know the joint plan as a whole? First, the existing distributed MAE planning by CDN [8] processes all plans in one batch. At the end of collect stage, the root agent knows the utility of the optimal plan. If we reduce the batch to a single joint plan, at the end of collect stage, root would know the expected utility of that plan. Second, to evaluate a joint plan partially, instead of passing expected utility, collect messages should contain utility based only on intended outcome. Third, we propose a method for root to determine the total number of joint plans. Consider the hypertree in Figure 4 over agents A, B, C and D with root A. Assume that x, y and z are the only action variables and are public (no private action variables in MAE). Each agent i maintains a counting variable di : the number of joint plans over agents downstream from i. Root A initiates message collection along hypertree (Figure 4). Leaf agent D passes to C message
430
Y. Xiang and F. Hanshar
dD = 1 (no downstream agent). C passes dC = dD ∗ |Dz | = 4 to B, and B passes dB = dC ∗ |Dy | = 12 to A. In the end, A computes the total number of joint plans as dA = dB ∗ |Dx | = 24. Fourth, as any joint plan is evaluated, each agent needs to know how to instantiate their local (public) variables accordingly. For instance, B needs to know the values of x and y, but not z. We assume that the order of domain values of each public variable, e.g., x ∈ (x0 , x1 ), is known to corresponding agents. Joint plans are lexicographically ordered based on domains of public variables. Hence, 0th joint plan corresponds to (x0 , y0 , z0 ), and 22nd to (x1 , y2 , z2 ). We propose a message distribution for each agent to determine values of local variables according to current joint plan. Each agent i maintains a working index wri . Root A sets wrA to the index of current joint plan. Each other agent receives wri in message. The index of a variable, say x, is denoted by xinx . Suppose A initiates message distribution with wrA = 22. A computes xinx = wrAdB% dA = 1, where % and are mod and f loor operations. A passes B the index wrB = wrA % dA = 22 to B. B computes xinx = wr dB = 1 and wrB % dB yinx = dC
= 2. B passes to C the index wrC = wrB % dB = 10. Similar computations at C and D determine zinx = 2. The above can be combined for distributed planning with partial evaluation. It consists of a sequence of message collection followed by one message distribution. The first collection fully evaluates the first joint plan. Local maximum and average utilities from agents are also collected and aggregated for use in all subsequent evaluations (Section 6). The second collection calls for a partial evaluation (Section 3) of the next joint plan. Upon receiving the response, A determines if the second joint plan needs full evaluation or can be discarded. If full evaluation is needed, A issues the next collection as a full evaluation of the second plan. Otherwise, a call of partial evaluation on the third joint plan is issued. This process continues until all joint plans are evaluated. One distribution is used after all plans are evaluated to communicate the optimal plan. If 22nd joint plan is optimal, a message distribution as described earlier suffices for each agent to determine its optimal local plan. It can be shown that the above protocol achieves the same level of optimality as centralized planning. However, for each joint plan, one round of communication is required, resulting in a communication amount exponential on the number of agents and horizon length. This differs from the existing method for planning in CDN (Section 8.1), where two rounds of communication are sufficient. 8.3
Aggregation of Local Evaluation
Given the above analysis, we consider an alternative that attempts to avoid intractable communication: Each agent applies partial evaluation to evaluate all local plans in a single batch. The results are then assembled through message passing in order to obtain the optimal joint plan. After local evaluation, agent i has a set Ei of fully evaluated local plans and a set Li from partial evaluation. From analysis in Section 3, Ei contains the local optimal plan at i.
Partial Evaluation for Planning in Multiagent Expedition
431
Table 4. Utilities for MAE team Joint Plan UA UB UC UABC Joint Plan UA UB UC UABC P1 0.3 0.3 0.3 0.9 P3 0.1 0.6 0.1 0.8 P2 0.6 0.1 0.1 0.8 P4 0.1 0.1 0.6 0.8
Consider some selected joint plans in Table 4 for agents A, B and C. Each row corresponds to an evaluated joint plan. Each agent i evaluates expected utilities locally as shown in the Ui column. Overall expected utilities are given in the UABC column as sum of local values. Joint plan P2 is the best according to evaluation by agent A. P3 and P4 are the best according to B and C, respectively. All of them are inferior to P1 . From the above illustration, the following can be concluded. Optimal planning cannot be obtained from independent local partial evaluations in general. It cannot be obtained based on Ei , nor Li or their combination.
9
Conclusion
The main contribution of this work is the method of partial evaluation for centralized planning in uncertain environments such as MAE. The key assumption on the environment is that each agent action has a distinguished intended outcome whose probability given the action is independent of (or approximately so) the chosen action. This assumption seems to be valid for many problem domains where actions normally achieve some intended consequences where failures are rare occurrences. We devised simple criteria to divide planning computation into full and partial evaluations to allow only a small subset of alternative plans to be fully evaluated while maintaining optimal or approximate optimal planning. Significant efficiency gains are obtained with our experiments. Alternatively, extending the method to distributed planning has resulted in unexpected outcomes. Two very different schemes are analyzed. One evaluates individual plans distributively, which demands an intractable amount of agent communication. Another evaluates local plans in batch and assembles the joint plan distributively, but is unable to guarantee a globally optimal joint plan. These analyses discover pitfalls in distributed planning and facilitate development of more effective methods. As such, we are currently exploring other schemes of distributed planning that can benefit from partial evaluation.
Acknowledgements We acknowledge financial support from Discovery Grant, NSERC, Canada, to the first author, and from NSERC Postgraduate Scholarship, to the second author.
432
Y. Xiang and F. Hanshar
References 1. Besse, C., Chaib-draa, B.: Parallel rollout for online solution of Dec-POMDPs. In: Proc. 21st Inter. Florida AI Research Society Conf., pp. 619–624 (2008) 2. Corona, G., Charpillet, F.: Distribution over beliefs for memory bounded DecPOMDP planning. In: Proc. 26th. Conf. on Uncertainty in AI, UAI 2010 (2010) 3. Kitano, H.: Robocup rescue: a grand challenge for multi-agent systems. In: Proc. 4th Int. Conf. on MultiAgent Systems, pp. 5–12 (2000) 4. Murphy, K.: A survey of POMDP solution techniques. Tech. rep., U.C. Berkeley (2000) 5. Oliehoek, F., Spaan, M., Whiteson, S., Vlassis, N.: Exploiting locality of interaction in factored Dec-POMDPs. In: Proc. 7th Inter. Conf. on Autonomous Agents and Multiagent Systems, pp. 517–524 (2008) 6. Ross, S., Pineau, J., Chaib-draa, B., Paquet, S.: Online planning algorithms for POMDPs. J. of AI Research, 663–704 (2008) 7. Xiang, Y., Chen, J., Havens, W.: Optimal design in collaborative design network. In: Proc. 4th Inter. Joint Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2005), pp. 241–248 (2005) 8. Xiang, Y., Hanshar, F.: Planning in multiagent expedition with collaborative design networks. In: Kobti, Z., Wu, D. (eds.) Canadian AI 2007. LNCS (LNAI), vol. 4509, pp. 526–538. Springer, Heidelberg (2007)
Author Index
Gaudette, Lisa 146 Ghassem-Sani, Gholamreza Guo, Yuanyuan 158
Aaron, Eric 1 Aavani, Amir 13 Abdelsalam, Wegdan 26 Afzal, Naveed 32 Ahmed, Maher 26 Al-Obeidat, Feras 56 An, Aijun 347 Anton, Cˇ alin 44 Bakhshandeh Babarsad, Omid Bediako-Asare, Henry 50 Belacel, Nabil 56 Buffett, Scott 50 Calvo, Borja 186 Carenini, Giuseppe 122 Cercone, Nick 347 Chaffar, Soumaya 62 Chaib-draa, Brahim 86 Chali, Yllias 68 Charton, Eric 74 Chau, Siu-Cheung 26 Chinaei, Hamid R. 86 Chiu, David 26 Connor, Patrick 92 Connors, Warren A. 174 De Angelis, Silvio 170 ´ de Souza, Erico N. 384 Do, Thang M. 104 Du, Weichang 372 Ebrahim, Yasser Elinas, Pantelis
26 110
Fan, Lisa 240 Farzindar, Atefeh 32 Ferguson, Daniel S. 110 FitzGerald, Nicholas 122 Fleming, Michael W. 50 Fowler, Ben 128 Frunza, Oana 140 Gagnon, Michel 74 Gao, Qigang 234
313
313
Hanshar, F. 420 Hasan, Sadid A. 68 Higgins, Machel 170 Hilderman, Robert 359 Hoeber, Orland 281 Hollesen, Paul 174 Imam, Kaisar 68 Inkpen, Diana 62, 140, 192, 216 Irurozki, Ekhine 186 Islam, Aminul 192 Jain, Dreama 204 Japkowicz, Nathalie 146 Jiang, Yifei 210 Jin, Wei 396 Joty, Shafiq 122 Joubarne, Colette 216 Kennedy, Alistair 222 Kershaw, David 234 Khamis, Ninus 408 Khawaja, Bushra 240 Khordad, Maryam 246 Klement, William 258 Kobti, Ziad 204 Ladani, Behrouz Tork 301 Langlais, Philippe 323 Larue, Othalia 265 Li, Zijie 269 Liu, Fei 104 Liu, Hanze 281 Liu, Xiaobo 158 Loke, Seng W. 104 Lozano, Jose A. 186 Luo, Jigang 285 Matwin, Stan 258, 384 Mendoza, Juan Pablo 1 Mengistu, Kinfe Tadesse 291
434
Author Index
Mercer, Robert E. 246 Michalowski, Wojtek 258 Milios, Evangelos 377 Mirroshandel, Seyed Abolghassem Mitchell, David 13 Mitkov, Ruslan 32 Mokhtari, Ehsan 301 Mostafazadeh, Nasrin 313 Mouine, Mohamed 319 Muller, Philippe 323 Murray, Gabriel 122 Nematbakhsh, Mohammad Ali Noorian, Zeinab 301 Ozell, Benoit
74
Pouly, Marc
335
Rilling, Juergen 408 Rogan, Peter 246 Rudzicz, Frank 291 Sarrafzadeh, Bahareh 347 Sateli, Bahar 408 Silver, Daniel L. 128 Simeon, Mondelle 359 Song, Weihong 372
301
313
Soto, Axel J. 377 Spencer, Bruce 372 Strickert, Marc 377 Su, Ming 390 Sun, Zhengya 396 Szpakowicz, Stan 222 Ternovska, Eugenia 13 Thompson, Elizabeth 390 Trappenberg, Thomas 92, 174 van Beek, Peter 269 Vazquez, Gustavo E. 377 Wang, Hai 234 Wang, Jue 396 Ward, Christopher 170 Wilk, Szymon 258 Witte, Ren´e 408 Wu, Xiongnan 13 Xiang, Y.
420
Yakovets, Nikolay Yao, Yiyu 285 Zhang, Haiyi Zhang, Harry
210 158
347