Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6594
Andrej Dobnikar Uroš Lotriˇc Branko Šter (Eds.)
Adaptive and Natural Computing Algorithms 10th International Conference, ICANNGA 2011 Ljubljana, Slovenia, April 14-16, 2011 Proceedings, Part II
13
Volume Editors Andrej Dobnikar Uroš Lotriˇc Branko Šter University of Ljubljana Faculty of Computer and Information Science Tržaška 25, 1000 Ljubljana, Slovenia E-mail: {andrej.dobnikar, uros.lotric, branko.ster}@fri.uni-lj.si
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-20266-7 e-ISBN 978-3-642-20267-4 DOI 10.1007/978-3-642-20267-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011923992 CR Subject Classification (1998): F.1-2, I.2.3, I.2, I.5, D.2.2, D.4.7, D.1 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The 2011 edition of ICANNGA marked the 10th anniversary of the conference series, started in 1993 in Innsbruck, Austria, where it was decided to have a similar scientific meeting organized biennially. Since then, and with considerable success, the conference has taken place in Ales in France (1995), Norwich in the UK (1997), Portoroˇz in Slovenia (1999), Prague in the Czech Republic (2001), Roanne in France (2003), Coimbra in Portugal (2005), Warsaw in Poland (2007), and Kuopio in Finland (2009), while this year, for the second time, in Slovenia, in its capital Ljubljana (2011). The Faculty of Computer and Information Science of the University of Ljubljana was pleased and honored to host this conference. We chose the old university palace as the conference site in order to keep the traditionally good academic atmosphere of the meeting. It is located in the very centre of the capital and is surrounded by many cultural and touristic sights. The ICANNGA conference was originally limited to neural networks and genetic algorithms, and was named after this primary orientation: International Conference on Artificial Neural Networks and Genetic Algorithms. Very soon the conference broadened its outlook and in Coimbra (2005) the same abbreviation got a new meaning: International Conference on Adaptive and Natural computiNG Algorithms. Thereby the popular short name remained and yet the conference is widely open to many new disciplines related to adaptive and natural algorithms. This year we received 144 papers from 33 countries. After a peer-review process by at least two reviewers per paper, 83 papers were accepted and included in the proceedings. The papers were divided into seven groups: neural networks, evolutionary computation, pattern recognition, soft computing, system theory, support vector machines, and bio-informatics. The submissions were recommended for oral and for poster presentation. The ICANNGA 2011 plenary lectures were planned to combine several compatible disciplines like adaptive computation (Rudolf Albrecht), artificial intelligence (Ivan Bratko), synthetic biology and biomolecular modelling of new biological systems (Roman Jerala), computational neurogenetic modelling (Nikola Kasabov), and robots with biological brains (Kevin Warwick). We believe these discussions served as an inspiration for future contributions. One of the traditions of all ICANNGA conferences so far has been to combine pleasantness and usefulness. The cultural and culinary traditions of the organizing country helped to create an atmosphere for a successful and friendly meeting. We would like to thank the Advisory Committee for their guidance, advice and discussions. Furthermore, we wish to express our gratitude to the Program Committee, the reviewers and sub-reviewers for their substantial work in revising
VI
Preface
the papers. Our recognition also goes to Springer, our publisher, and especially to Alfred Hofmann, Editor-in-Chief of LNCS, for his support and collaboration. ˇ Many thanks go to the agency Go-mice and its representative Natalija Bah Cad for her help and effort. And last but not least, on behalf of the Organizing Committee of ICANNGA 2011, we want to express our special recognition to all the participants, who contributed enormously to the success of the conference. We hope that you will enjoy reading this volume and that you will find it inspiring and stimulating for your future work and research. April 2011
Andrej Dobnikar Uroˇs Lotriˇc ˇ Branko Ster
Organization
ICANNGA 2011 was organized by the Faculty of Computer and Information Science, University of Ljubljana, Slovenia.
Advisory Committee Rudolf Albrecht Bartlomiej Beliczynski Andrej Dobnikar Mikko Kolehmainen Vera Kurkova David Pearson Bernardete Ribeiro Nigel Steele
University of Innsbruck, Austria Warsaw University of Technology, Poland University of Ljubljana, Slovenia University of Eastern Finland, Finland Academy of Sciences of the Czech Republic, Czech Republic University Jean Monnet of Saint-Etienne, France University of Coimbra, Portugal Coventry University, UK
Program Committee Andrej Dobnikar, Slovenia (Chair) Jarmo Alander, Finland Rudolf Albrecht, Austria Rub´en Arma˜ nanzas, Spain Bartlomiej Beliczynski, Poland Ernesto Costa, Portugal Janez Demˇsar, Slovenia Antonio Dourado, Portugal Stefan Figedy, Slovakia Alexandru Floares, Romania Juan A. Gomez-Pulido, Spain Barbara Hammer, Germany Honggui Han, China Osamu Hoshino, Japan Marcin Iwanowski, Poland Martti Juhola, Finland Paul C. Kainen, USA Helen Karatza, Greece Kostas D. Karatzas, Greece Nikola Kasabov, New Zealand Mikko Kolehmainen, Finland Igor Kononenko, Slovenia Jozef Korbicz, Poland
Vera Kurkova, Czech Republic Kauko Leiviska, Finland Aleˇs Leonardis, Slovenia Uroˇs Lotriˇc, Slovenia Danilo P. Mandic, UK Francesco Masulli, Italy Roman Neruda, Czech Republic Stanislaw Osowski, Poland David Pearson, France Jan Peters, Germany Bernardete B. Ribeiro, Portugal Juan M. Sanchez-Perez, Spain Catarina Silva, Portugal Nigel Steele, UK ˇ Branko Ster, Slovenia Miroslaw Swiercz, Poland Ryszard Tadeusiewicz, Poland Tatiana Tambouratzis, Greece Miguel A. Vega-Rodriguez, Spain Kevin Warwick, UK Blaˇz Zupan, Slovenia
VIII
Organization
Organizing Committee Andrej Dobnikar Uroˇs Lotriˇc ˇ Branko Ster Nejc Ilc
Davor Sluga Jernej Zupanc ˇ Natalija Bah Cad
Reviewers Jarmo Alander Rudolf Albrecht Ana de Almeida M´ ario Joao Antunes Rub´en Arma˜ nanzas Iztok Lebar Bajec Bartlomiej Beliczynski Zoran Bosni´c Ernesto Costa Janez Demˇsar Andrej Dobnikar Antonio Dourado Stefan Figedy Alexandru Floares Juan A. Gomez-Pulido ˇ Crtomir Gorup Barbara Hammer Honggui Han Jorge Henriques Osamu Hoshino Marcin Iwanowski Martti Juhola Paul C. Kainen Helen Karatza Kostas D. Karatzas Nikola Kasabov Mikko Kolehmainen Igor Kononenko Jozef Korbicz Vera Kurkova Kauko Leiviska Aleˇs Leonardis Pedro Luis L´ opez-Cruz Uroˇs Lotriˇc
Danilo P. Mandic Francesco Masulli Neˇza Mramor Kosta Miha Mraz Roman Neruda Dominik Olszewski Stanislaw Osowski David Pearson Jan Peters Matija Polajnar Mengyu Qiao Bernardete B. Ribeiro ˇ Marko Robnik Sikonja Mauno R¨ onkk¨ o Gregor Rot Aleksander Sadikov Juan M. Sanchez-Perez Catarina Silva Danijel Skoˇcaj Nigel Steele Miroslaw Swiercz ˇ Miha Stajdohar ˇ Branko Ster Ryszard Tadeusiewicz Tatiana Tambouratzis Marko Toplak Miguel A. Vega-Rodriguez Alen Vreˇcko Kevin Warwick Blaˇz Zupan ˇ Jure Zabkar ˇ Lan Zagar ˇ Jure Zbontar
Table of Contents – Part II
Pattern Recognition and Learning Asymmetric k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Olszewski
1
Gravitational Clustering of the Self-Organizing Map . . . . . . . . . . . . . . . . . . Nejc Ilc and Andrej Dobnikar
11
A General Method for Visualizing and Explaining Black-Box Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Erik Strumbelj and Igor Kononenko
21
An Experimental Study on Electrical Signature Identification of Non-Intrusive Load Monitoring (NILM) Systems . . . . . . . . . . . . . . . . . . . . . Marisa B. Figueiredo, Ana de Almeida, and Bernardete Ribeiro
31
Evaluation of a Resource Allocating Network with Long Term Memory Using GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernardete Ribeiro, Ricardo Quintas, and Noel Lopes
41
Gabor Descriptors for Aerial Image Classification . . . . . . . . . . . . . . . . . . . . Vladimir Risojevi´c, Snjeˇzana Momi´c, and Zdenka Babi´c
51
Text Representation in Multi-label Classification: Two New Input Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rodrigo Alfaro and H´ector Allende
61
Fraud Detection in Telecommunications Using Kullback-Leibler Divergence and Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Olszewski
71
Classification of EEG in a Steady State Visual Evoked Potential Based Brain Computer Interface Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˙scan, Ozen ¨ ¨ Zafer I¸ Ozkaya, and Z¨ umray Dokur
81
Fast Projection Pursuit Based on Quality of Projected Clusters . . . . . . . . Marek Grochowski and Wlodzislaw Duch A New N-gram Feature Extraction-Selection Method for Malicious Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamid Parvin, Behrouz Minaei, Hossein Karshenas, and Akram Beigi
89
98
X
Table of Contents – Part II
A Robust Learning Model for Dealing with Missing Values in Many-Core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noel Lopes and Bernardete Ribeiro A Model of Saliency-Based Selective Attention for Machine Vision Inspection Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao-Feng Ding, Li-Zhong Xu, Xue-Wu Zhang, Fang Gong, Ai-Ye Shi, and Hui-Bin Wang Grapheme-Phoneme Translator for Brazilian Portuguese . . . . . . . . . . . . . . Danilo Picagli Shibata and Ricardo Luis de Azevedo da Rocha
108
118
127
Soft Computing Improvement of Inventory Control under Parametric Uncertainty and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas Nechval, Konstantin Nechval, Maris Purgailis, and Uldis Rozevskis Modified Jakubowski Shape Transducer for Detecting Osteophytes and Erosions in Finger Joints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marzena Bielecka, Andrzej Bielecki, Mariusz Korkosz, Marek Skomorowski, Wadim Wojciechowski, and Bartosz Zieli´ nski Using CMAC for Mobile Robot Motion Control . . . . . . . . . . . . . . . . . . . . . . Krist´ of G´ ati and G´ abor Horv´ ath
136
147
156
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierre Buesser, Fabio Daolio, and Marco Tomassini
167
Numerically Efficient Analytical MPC Algorithm Based on Fuzzy Hammerstein Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr M. Marusak
177
Online Adaptation of Path Formation in UAV Search-and-Identify Missions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Willem H. van Willigen, Martijn C. Schut, A.E. Eiben, and Leon J.H.M. Kester Reconstruction of Causal Networks by Set Covering . . . . . . . . . . . . . . . . . . Nick Fyson, Tijl De Bie, and Nello Cristianini The Noise Identification Method Based on Divergence Analysis in Ensemble Methods Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryszard Szupiluk, Piotr Wojewnik, and Tomasz Zabkowski
186
196
206
Table of Contents – Part II
Efficient Predictive Control and Set–Point Optimization Based on a Single Fuzzy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr M. Marusak Wind Turbines States Classification by a Fuzzy-ART Neural Network with a Stereographic Projection as a Signal Normalization . . . . . . . . . . . . Tomasz Barszcz, Marzena Bielecka, Andrzej Bielecki, and Mateusz W´ ojcik Binding and Cross-Modal Learning in Markov Logic Networks . . . . . . . . . Alen Vreˇcko, Danijel Skoˇcaj, and Aleˇs Leonardis
XI
215
225
235
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents in Nondeterministic Environments . . . . . . . . . . . . . . . . . . Akram Beigi, Nasser Mozayani, and Hamid Parvin
245
Parallel Graph Transformations Supported by Replicated Complementary Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leszek Kotulski and Adam S¸edziwy
254
Diagnosis of Cardiac Arrhythmia Using Fuzzy Immune Approach . . . . . . Olgierd Unold
265
Systems Theory Adaptive Finite Automaton: A New Algebraic Approach . . . . . . . . . . . . . . Reginaldo Inojosa Silva Filho and Ricardo Luis de Azevedo da Rocha
275
Cryptanalytic Attack on the Self-Shrinking Sequence Generator . . . . . . . . Maria Eugenia Pazo-Robles and Amparo F´ uster-Sabater
285
About Nonnegative Matrix Factorization: On the posrank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana de Almeida
295
Stability of Positive Fractional Continuous-Time Linear Systems with Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tadeusz Kaczorek
305
Output-Error Model Training for Gaussian Process Models . . . . . . . . . . . . Juˇs Kocijan and Dejan Petelin
312
Support Vector Machines Learning Readers’ News Preferences with Support Vector Machines . . . . Elena Hensinger, Ilias Flaounas, and Nello Cristianini
322
XII
Table of Contents – Part II
Incorporating a Priori Knowledge from Detractor Points into Support Vector Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Orchel
332
A Hybrid AIS-SVM Ensemble Approach for Text Classification . . . . . . . . M´ ario Antunes, Catarina Silva, Bernardete Ribeiro, and Manuel Correia
342
Regression Based on Support Vector Classification . . . . . . . . . . . . . . . . . . . Marcin Orchel
353
Two One-Pass Algorithms for Data Stream Classification Using Approximate MEBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˜ Ricardo Nanculef, H´ector Allende, Stefano Lodi, and Claudio Sartori
363
Bioinformatics X-ORCA - A Biologically Inspired Low-Cost Localization System . . . . . . Enrico Heinrich, Marian L¨ uder, Ralf Joost, and Ralf Salomon On the Origin and Features of an Evolved Boolean Model for Subcellular Signal Transduction Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Branko Ster, Monika Avbelj, Roman Jerala, and Andrej Dobnikar
373
383
Similarity of Transcription Profiles for Genes in Gene Sets . . . . . . . . . . . . Marko Toplak, Tomaˇz Curk, and Blaˇz Zupan
393
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
401
Table of Contents – Part I
Plenary Session Autonomous Discovery of Abstract Concepts by a Robot . . . . . . . . . . . . . . Ivan Bratko
1
Neural Networks Kernel Networks with Fixed and Variable Widths . . . . . . . . . . . . . . . . . . . . Vˇera K˚ urkov´ a and Paul C. Kainen
12
Evaluating Reliability of Single Classifications of Neural Networks . . . . . . ˇ Darko Pevec, Erik Strumbelj, and Igor Kononenko
22
Nonlinear Predictive Control Based on Multivariable Neural Wiener Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maciej L awry´ nczuk
31
Methods of Integration of Ensemble of Neural Predictors of Time Series - Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stanislaw Osowski and Krzysztof Siwek
41
A Rejection Option for the Multilayer Perceptron Using Hyperplanes . . . Eduardo Gasca A., Sergio Salda˜ na T., Jos´e S. S´ anchez G., Valent´ın Vel´ asquez G., Er´endira Rend´ on L., Itzel M. Abundez B., Rosa M. Valdovinos R., and Rafael Cruz R.
51
Parallelization of Algorithms with Recurrent Neural Networks . . . . . . . . . Jo˜ ao Pedro Neto and Fernando Silva
61
Parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olena Schuessler and Diego Loyola
70
Supporting Diagnostics of Coronary Artery Disease with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matjaˇz Kukar and Ciril Groˇselj
80
The Right Delay: Detecting Specific Spike Patterns with STDP and Axonal Conduction Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arvind Datadien, Pim Haselager, and Ida Sprinkhuizen-Kuyper
90
XIV
Table of Contents – Part I
New Measure of Boolean Factor Analysis Quality . . . . . . . . . . . . . . . . . . . . Alexander A. Frolov, Dusan Husek, and Pavel Yu. Polyakov
100
Mechanisms of Adaptive Spatial Integration in a Neural Model of Cortical Motion Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Ringbauer, Stephan Tschechne, and Heiko Neumann
110
Self-organized Short-Term Memory Mechanism in Spiking Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Kiselev
120
Approximation of Functions by Multivariable Hermite Basis: A Hybrid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bartlomiej Beliczynski
130
Using Pattern Recognition to Predict Driver Intent . . . . . . . . . . . . . . . . . . . Firas Lethaus, Martin R.K. Baumann, Frank K¨ oster, and Karsten Lemmer
140
Neural Networks Committee for Improvement of Metal’s Mechanical Properties Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga A. Mishulina, Igor A. Kruglov, and Murat B. Bakirov
150
Logarithmic Multiplier in Hardware Implementation of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uroˇs Lotriˇc and Patricio Buli´c
158
Efficiently Explaining Decisions of Probabilistic RBF Classification Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Marko Robnik-Sikonja, Aristidis Likas, Constantinos Constantinopoulos, Igor Kononenko, and Erik Strumbelj
169
Evolving Sum and Composite Kernel Functions for Regularization Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petra Vidnerov´ a and Roman Neruda
180
Optimisation of Concentrating Solar Thermal Power Plants with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Pascal Richter, Erika Abrah´ am, and Gabriel Morin
190
Emergence of Attention Focus in a Biologically-Based Bidirectionally-Connected Hierarchical Network . . . . . . . . . . . . . . . . . . . . . . Mohammad Saifullah and Rita Kovord´ anyi
200
Visualizing Multidimensional Data through Multilayer Perceptron Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Neme and Antonio Nido
210
Table of Contents – Part I
Input Separability in Living Liquid State Machines . . . . . . . . . . . . . . . . . . . Robert L. Ortman, Kumar Venayagamoorthy, and Steve M. Potter Predictive Control of a Distillation Column Using a Control-Oriented Neural Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maciej L awry´ nczuk Neural Prediction of Product Quality Based on Pilot Paper Machine Process Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paavo Nieminen, Tommi K¨ arkk¨ ainen, Kari Luostarinen, and Jukka Muhonen A Robotic Scenario for Programmable Fixed-Weight Neural Networks Exhibiting Multiple Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guglielmo Montone, Francesco Donnarumma, and Roberto Prevete Self-Organising Maps in Document Classification: A Comparison with Six Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jyri Saarikoski, Jorma Laurikkala, Kalervo J¨ arvelin, and Martti Juhola Analysis and Short-Term Forecasting of Highway Traffic Flow in Slovenia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Primoˇz Potoˇcnik and Edvard Govekar
XV
220
230
240
250
260
270
Evolutionary Computation A New Method of EEG Classification for BCI with Feature Extraction Based on Higher Order Statistics of Wavelet Components and Selection with Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Kolodziej, Andrzej Majkowski, and Remigiusz J. Rak Regressor Survival Rate Estimation for Enhanced Crossover Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alina Patelli and Lavinia Ferariu A Study on Population’s Diversity for Dynamic Environments . . . . . . . . . Anabela Sim˜ oes, Rui Carvalho, Jo˜ ao Campos, and Ernesto Costa Effect of the Block Occupancy in GPGPU over the Performance of Particle Swarm Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel C´ ardenas-Montes, Miguel A. Vega-Rodr´ıguez, Juan Jos´e Rodr´ıguez-V´ azquez, and Antonio G´ omez-Iglesias Two Improvement Strategies for Logistic Dynamic Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qingjian Ni and Jianming Deng
280
290
300
310
320
XVI
Table of Contents – Part I
Digital Watermarking Enhancement Using Wavelet Filter Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr Lipi´ nski and Jan Stolarek
330
CellularDE: A Cellular Based Differential Evolution for Dynamic Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vahid Noroozi, Ali B. Hashemi, and Mohammad Reza Meybodi
340
Optimization of Topological Active Nets with Differential Evolution . . . . Jorge Novo, Jos´e Santos, and Manuel G. Penedo Study on the Effects of Pseudorandom Generation Quality on the Performance of Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ¨ am¨ Ville Tirronen, Sami Ayr¨ o, and Matthieu Weber Sensitiveness of Evolutionary Algorithms to the Random Number Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel C´ ardenas-Montes, Miguel A. Vega-Rodr´ıguez, and Antonio G´ omez-Iglesias
350
361
371
New Efficient Techniques for Dynamic Detection of Likely Invariants . . . Saeed Parsa, Behrouz Minaei, Mojtaba Daryabari, and Hamid Parvin
381
Classification Ensemble by Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . Hamid Parvin, Behrouz Minaei, Akram Beigi, and Hoda Helmi
391
Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm for Electric Circuit Units (ECUs) . . . . . . . . . . . . . . . . . . . . . . . . . Umair F. Siddiqi, Yoichi Shiraishi, Mona A. El-Dahb, and Sadiq M. Sait Taxi Pick-Ups Route Optimization Using Genetic Algorithms . . . . . . . . . . Jorge Nunes, Lu´ıs Matos, and Ant´ onio Trigo
400
410
Optimization of Gaussian Process Models with Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dejan Petelin, Bogdan Filipiˇc, and Juˇs Kocijan
420
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
431
Asymmetric k-Means Algorithm Dominik Olszewski Faculty of Electrical Engineering, Warsaw University of Technology, Poland
[email protected]
Abstract. In this paper, an asymmetric version of the k-means clustering algorithm is proposed. The asymmetry arises caused by the use of asymmetric dissimilarities in the k-means algorithm. Application of asymmetric measures of dissimilarity is motivated with a basic nature of the k-means algorithm, which uses dissimilarities in an asymmetric manner. Clusters centroids are treated as the dominance points governing the asymmetric relationships in the entire cluster analysis. The results of experimental study on the real data have shown the superiority of asymmetric dissimilarities employed for the k-means method over their symmetric counterparts. Keywords: k-means recognition.
1
clustering,
Asymmetric
dissimilarity,
Signal
Introduction
The k-means clustering algorithm [1,2,3,4,5] is a well-known statistical data analysis tool used in order to form arbitrary settled number of clusters in the analyzed data set. The algorithm aims to separate clusters of possibly most similar objects. Object represented as a vector of d features can be interpreted as a point in d-dimensional space. Hence, the k-means algorithm can be formulated as follows: given n points in d-dimensional space, and the number k of desired clusters, the algorithm seeks a set of k clusters so as to minimize the sum of squared dissimilarities between each point and its cluster centroid. The name “k-means” was introduced in [2], however, the algorithm, itself, was formulated by H.Steinhaus in [1]. The k-means algorithm forms clusters on the basis of multiple allocations of objects to the nearest clusters. The nearest cluster is the one with a minimal dissimilarity between its centroid and an object being allocated. Hence, the principal behavior of the discussed algorithm is based on evaluating a dissimilarity between two distinct entities (object vs. cluster centroid). The Euclidean distance, most frequently used in k-means, like any other symmetric measure, does not apply properly to evaluating a dissimilarity between a single object and a cluster centroid. We propose employing of the asymmetric dissimilarities in the k-means algorithm, since we claim that it is more consistent with the fundamental nature of this algorithm, i.e., properly reflects the asymmetric relationship between a single object and a cluster centroid. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 1–10, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
D. Olszewski
The application of asymmetric dissimilarities in data analysis has been extensively studied by A.Okada and T.Imaizumi [6,7,8]. They have concentrated in their work on the multidimensional scaling for analyzing one-mode two-way (object × object) and two-mode three-way (object × object × source) asymmetric proximities. They have introduced the dominance point governing asymmetry in the proximity relationships among objects, represented as points in the multidimensional Euclidean space. They claim that ignoring or neglecting the asymmetry in proximity analysis discards potentially valuable information. Our method can be regarded as the extension of this solution for the k-means clustering algorithm, where the centroids of the clusters are treated as the dominance points governing the multiple allocations of objects, and, consequently, governing the whole clustering process. Therefore, the distinction between a centroid and a single object is that the centroid is a privileged entity acting as an “attractor” of objects in the analyzed data set. Our solution can be also interpreted as the generalization of Okada’s and Imaizumi’s idea for the multidimensional non-Euclidean spaces associated with the non-standard asymmetric dissimilarity measures, like, the Kullback-Leibler divergence, for example. Finally, we wanted to confirm and continue their assertion that the property of asymmetry does not have to be considered as the inhibiting shortcoming, but, quite the contrary, in certain areas of research, it can be even significantly beneficial.
2
Dissimilarities
In this section, we briefly present six dissimilarity measures. Three of them are symmetric (Hellinger distance, total variation distance, and Euclidean distance), one is asymmetric (Kullback-Leibler divergence), and two are either symmetric or asymmetric, depending on the value of their parameters (Chernoff disance, and Lissack-Fu distance). Some of these measures are metrics (satisfy all metric conditions), and some are not, but they still present interesting properties. We wanted to compare the usefulness of symmetric and asymmetric dissimilarities employed for the k-means algorithm, in order to verify our assertion that asymmetric measures are more suitable for this algorithm. Throughout this section, we will use the following notation. Let P and Q denote two probability measures on a measurable space Ω with σ-algebra F . Let λ be a measure on (Ω, F) such that P and Q are absolutely continuous with respect to λ, with corresponding probability density functions p and q. All definitions presented in this section are independent of the choice of measure λ. 2.1
Symmetric Dissimilarities
Hellinger Distance Definition 1. The Hellinger distance between P and Q on a continuous measurable space (Ω, F ) is defined as 1/2 √ √ 2 def 1 dH (P, Q) = ( p − q) dλ . (1) 2 Ω
Asymmetric k-Means Algorithm
3
In some papers, the factor of 12 in Definition 1 is omitted. We consider definition containing this factor, as it normalizes the range of values taken by this dissimilarity. Some sources define the Hellinger distance as the square of dH . Defined by formula (1) the Hellinger distance is a metric, while d2H is not a metric, since it does not satisfy the triangle inequality. Total Variation Distance Definition 2. The total variation distance between P and Q on a continuous measurable space (Ω, F) is defined as √ √ def dTV (P, Q) = max h dP − h dQ = | p − q| dλ , (2) |h|≤1
Ω
Ω
Ω
where h: Ω → R satisfies |h(x)| ≤ 1. Total variation distance is a metric, which assumes values in interval [0, 2]. This dissimilarity is often called the L1 -norm of P − Q, and is denoted by P − Q1 . Euclidean Distance. This measure is used to determine the distance between two points in the Euclidean space. Definition 3. The Euclidean distance between points p = (p1 , p2 , . . . , pN ) and q = (q1 , q2 , . . . , qN ) in the N -dimensional Euclidean space is defined as N def 2 dE (p, q) = (pi − qi ) . (3) i=1
The Euclidean distance is a metric, which takes values from interval [0, ∞]. It can be interpreted as a generalization of the distance between two points in the plane, i.e., in the 2-dimensional Euclidean space, which can be derived from the Pythagorean theorem. 2.2
Asymmetric Dissimilarity
Kullback-Leibler Divergence (Relative Entropy) Definition 4. The Kullback-Leibler divergence between P and Q on a continuous measurable space (Ω, F ) is defined as
p def dKL (P, Q) = p log2 dλ . (4) q Ω According to the convention, the value of 0 log 0q is assumed to be 0 for all real q, and the value of p log p0 is assumed to be ∞ for all real non-zero p. Therefore, relative entropy takes values from interval [0, ∞]. Kullback-Leibler divergence is not a metric, since it is not symmetric and it does not satisfy the triangle inequality. However, it has many useful properties, including additivity over marginals of product measures.
4
2.3
D. Olszewski
Parametrized Dissimilarities
In this subsection, we present two dissimilarities which definitions involve parameters. Depending on the parameters values, these dissimilarities can be either symmetric or asymmetric. This property is very convenient for the purpose of this paper, since it allows for investigating the influence of symmetrizing and asymmetrizing of the same dissimilarity on the final results of clustering. Chernoff Distance Definition 5. The Chernoff distance between P and Q on a continuous measurable space (Ω, F ) is defined as
def α 1−α dCh (P, Q) = − log2 p q dλ , (5) Ω
where 0 < α < 1. Depending on the choice of the parameter α, the Chernoff distance can be either symmetric or asymmetric measure. For α = 0.5 it is symmetric and for all other values of this parameter it does not satisfy the symmetry condition. We have chosen α = 0.1 and α = 0.9 in order to obtain the asymmetric dissimilarity measure, while α = 0.5 resulted in a symmetric dissimilarity. Lissack-Fu Distance Definition 6. The Lissack-Fu distance between P and Q on a continuous measurable space (Ω, F ) is defined as |p Pa − q Pb |α def dLF (P, Q) = (6) α−1 dλ , Ω |p Pa + q Pb | where 0 ≤ α ≤ ∞. Changing values of the parameters Pa and Pb enables to obtain either symmetric or asymmetric dissimilarity. For Pa = Pb one has a symmetric measure and for Pa = Pb the measure becomes asymmetric. The value of the parameter α does not affect the symmetry property of the dissimilarity. Therefore, in our experiments, we have fixed α = 0.5.
3
Asymmetric k-Means Clustering
The asymmetric k-means algorithm starts from random choice of k objects from the entire data space. These objects are used to form initial clusters – each containing one object. Then, the algorithm consists of two alternating steps: Step 1. Forming of the clusters: The algorithm iterates over the entire data set, and allocates each object to the cluster represented by the centroid – nearest to this object. The nearest centroid is determined with use of a chosen asymmetric dissimilarity measure. Therefore, for each object in
Asymmetric k-Means Algorithm
5
the analyzed data set, the following minimal asymmetric dissimilarity has to be found: (7)
min dASYM (FEnew , FEci ) , i
where dASYM is the chosen asymmetric dissimilarity measure, FEnew is the vector of features of a given object in the analyzed data set, and FEci is the vector of features of the i-th cluster centroid, i = 1, . . . , k. This process can be presented with the following pseudocode: for x ∈ X do min ← M AX_V ALU E for c ∈ centroids do if min > dASYM (x, c) then min ← dASYM (x, c) x temporarily belongs to cluster cluster(c) end if end for end for After the execution of this pseudocode, each object x from the entire data set X is allocated to the cluster represented by the centroid nearest to this object. The centroids variable stores the set of all current centroids, cluster(c) denotes the cluster with centroid c, min is an auxiliary variable, while the M AX_V ALU E is the maximal value of the min variable. Step 2. Finding centroids for the clusters: For each cluster, a centroid is determined on the basis of objects belonging to this cluster. The algorithm calculates centroids of the clusters so as to minimize a formal objective function, the mean-squared-error (MSE) distortion: MSE(Xj ) =
nj
2
dASYM (xi , cj ) ,
(8)
i=1
where Xj , j = 1, . . . , k, is the j-th cluster; xi , i = 1, . . . , nj , are the objects in the j-th cluster; nj , j = 1, . . . , k, is the number of objects in the j-th cluster; cj , j = 1, . . . , k, is the centroid of the j-th cluster; k is the number of clusters; dASYM (a, b) is a chosen asymmetric dissimilarity measure. Both these steps must be carried out with the same dissimilarity measure in order to guarantee the monotone property of the k-means algorithm. Steps 1 and 2 have to be repeated until the termination condition is met. The termination condition might be either reaching convergence of the iterative application of the objective function (9), or reaching the pre-defined number of cycles. After each cycle (Step 1 and 2), a value of the following mean-squared-error objective function needs to be computed in order to track the convergence of the whole clustering process:
6
D. Olszewski
MSE(X) =
nj k
2
dASYM (xi , cj ) ,
(9)
j=1 i=1
where X is the analyzed set of objects, and the rest of notation is described in (8). A serious problem concerning the traditional k-means algorithm (i.e., using the symmetric dissimilarities), and the asymmetric k-means version, proposed in this paper, is that clustering process may not converge to an optimal, or nearoptimal configuration. The algorithm can assure only local optimality, which depends on the initial locations of the objects. An exhaustive study of asymptotic behavior of the k-means algorithm is conducted in [2]. 3.1
Minimization Technique Employed for Finding Centroids of the Clusters
The minimization technique, we have employed is the traditional complete search method. The variables space was, in our study, the feature space, i.e., the search was conducted in the feature space. For the numerical simplicity and speed, we have limited the variables space to the points corresponding to the current members of the specific cluster. This means that the search process was carried out in the set of current members of the considered cluster. This kind of approach is sometimes referred to as the k-medoids algorithm. We justify this simplification with the fact of irrelevant performance decrease in a case of clusters with a large number of objects. The objective function (8) was the criterion of minimization process. Therefore, the minimization technique, we have used, can be presented with the following pseudocode: min ← M AX_V ALU E sum ← 0 for i ∈ cluster do for j ∈ cluster do sum ← sum + dASYM (i, j) end for if min > sum then min ← sum centroid ← i end if sum ← 0 end for After the execution of this pseudocode, the centroid variable stores the coordinates of the centroid of the given cluster. The function dASYM (i, j) is a chosen asymmetric dissimilarity measure, while the min and sum are the auxiliary variables. The cluster variable represents the specific cluster for which a centroid is being computed, and M AX_V ALU E is the maximal value of the min variable.
Asymmetric k-Means Algorithm
4
7
Experiments
We have tested performance of the discussed improved k-means clustering algorithm by carrying out experiments on the real data: in the field of signal recognition, i.e., piano music composer clustering and human heart rhythm clustering. Human heart rhythms are represented with ECG recordings derived from the MIT-BIH ECG Databases [9]. We have employed different symmetric, asymmetric, and parametrized dissimilarities presented in Section 2, in order to evaluate their effectiveness in cooperating with the discussed k-means algorithm. Consequently, we verify the main assertion of this paper, which is the proposition of applying asymmetric dissimilarities as more recommended for the k-means algorithm. 4.1
Piano Music Composer Clustering
In this part of our experiments, we tested our enhancement to the k-means algorithm forming three clusters representing three piano music composers: Johann Sebastian Bach, Ludwig van Beethoven, and Fryderyk Chopin. Numbers of music pieces belonging to each of these composers are given in Table 1. Each music piece was represented with a 20-seconds sound signal sampled with the 44100 Hz frequency. The entire data set was composed of 32 sound signals. The feature extraction process was carried out according to the traditional Discrete-FourierTransform-based (DFT-based) method. The DFT was implemented with the fast Fourier transform (FFT) algorithm. Sampling signals with the 44100 Hz frequency resulted in the 44100/2 Hz value of the upper boundary of the FFT result range. The results of this part of our experiments are gathered in Table 1, which presents the accuracy degree of clustering with k-means cooperating with asymmetric dissimilarities, and with their symmetric counterparts. The numbers 1 and 2 given with each asymmetric dissimilarity denote this dissimilarity computed in two different directions, i.e., dASYM 1 = dASYM (p, q) and dASYM 2 = dASYM (q, p). The asymmetric Chernoff distance was obtained by applying its parameter α = 0.9, while the symmetric Chernoff distance was obtained with the α = 0.5. The asymmetric Lissack-Fu distance, in turn, was obtained by applying its parameters Pa = 0.5 and Pb = 1.0, while the symmetric form of this quantity was obtained with the Pa = 1.0 and Pb = 1.0. The accuracies were calculated on the basis of the following accuracy degree: ai =
ximax , Ni
(10)
where ai , i = 1, 2, 3, is the accuracy degree for the i-th composer; ximax , i = 1, 2, 3, is the maximal number of music pieces of i-th composer in any of the clusters; Ni , i = 1, 2, 3, is the total number of music pieces of i-th composer. Once the accuracy degree for the i-th composer is calculated, the corresponding cluster is not considered in calculations of accuracy degrees for remaining composers.
8
D. Olszewski Table 1. Accuracies of piano music composer clustering Bach Beethoven Chopin Number of Signals Kullback-Leibler Divergence 1 Kullback-Leibler Divergence 2 Asymmetric Chernoff Distance 1 Symmetric Chernoff Distance Asymmetric Chernoff Distance 2 Asymmetric Lissack-Fu Distance 1 Symmetric Lissack-Fu Distance Asymmetric Lissack-Fu Distance 2 Hellinger Distance Total Variation Distance Euclidean Distance
Average Accuracy
11
12
9
0.818
0.750
0.778
0.781
0.727
0.667
0.778
0.719
0.818
0.750
0.778
0.781
0.727
0.750
0.778
0.750
0.727
0.667
0.778
0.719
0.818
0.750
0.778
0.781
0.727
0.750
0.778
0.750
0.727
0.667
0.778
0.719
0.727
0.750
0.778
0.750
0.636
0.750
0.778
0.719
0.818
0.583
0.556
0.656
Each row with the accuracy entries is ended with the average accuracy degree, estimating the quality of each clustering approach. It is the arithmetic average of all three accuracy degrees associated with all three composers: aaverage =
a1 + a2 + a3 . 3
(11)
This average accuracy degree we used as the basis of the comparison between investigated approaches. Table 1 shows that clustering with k-means algorithm and asymmetric dissimilarities allowed for obtaining better results than with the symmetric dissimilarities. What is worth noting, is the fact that clustering performance strongly depends on the direction of asymmetry in the case of asymmetric measures, i.e., whether we consider dASYM (p, q) or dASYM (q, p). Therefore, asymmetric dissimilarities outperform their symmetric competitors, if the right direction of asymmetry is chosen. In other case, they produce worse results. However, this kind of observation is not surprising, since, if the k-means algorithm operates in an asymmetric manner, then, the asymmetric dissimilarities should be applied in the direction of asymmetry, consistent with the direction of asymmetry of the algorithm, itself. How to determine this direction prior to the clustering, remains an open question, since we do not provide any procedure to find it this way in this paper. This may depend on the asymmetry in the data that is analyzed. The obvious and simplest way to determine this direction is on the basis of the final results of clustering, i.e., which direction corresponds to higher clustering performance. However, leaving these considerations aside, and assuming the proper direction of asymmetry is chosen, the experimental results confirmed,
Asymmetric k-Means Algorithm
9
that asymmetric dissimilarities employed for k-means algorithm are superior in comparison with the symmetric measures cooperating with this algorithm, what is the main assertion of this paper. 4.2
Human Heart Rhythm Clustering
In this part of our experiments, we investigated our algorithm forming three clusters representing three types of human heart rhythms: normal sinus rhythm, atrial arrhythmia, and ventricular arrhythmia. This kind of clustering can be viewed as the cardiac arrhythmia detection and recognition based on the ECG recordings. In general, the cardiac arrhythmia disease may be classified either by rate (tachycardias – the heart beat is too fast, and bradycardias – the heart beat is too slow) or by site of origin (atrial arrhythmias – they begin in the atria, and ventricular arrhythmias – they begin in the ventricles). Our clustering recognizes the normal rhythm, and, also, recognizes arrhythmias originating in the atria, and in the ventricles. We analyzed 20-minutes ECG holter recordings sampled with the 250 Hz frequency. The entire data set was composed of 63 ECG signals. Numbers of recordings belonging to each rhythm type are given in Table 2. The feature extraction was carried out in the same way, like it was done with the piano music composer clustering. The results of this part of our experiments are gathered in Table 2, which is constructed in the same way as Table 1. The accuracy degrees and average accuracy degrees are also calculated in the similar way as in the previous subsection (formulae (10) and (11), respectively) with the only difference that instead of composers we regard three types of human heart rhythms. Table 2. Accuracies of human heart rhythm clustering
Number of Signals Kullback-Leibler Divergence 1 Kullback-Leibler Divergence 2 Asymmetric Chernoff Distance 1 Symmetric Chernoff Distance Asymmetric Chernoff Distance 2 Asymmetric Lissack-Fu Distance 1 Symmetric Lissack-Fu Distance Asymmetric Lissack-Fu Distance 2 Hellinger Distance Total Variation Distance Euclidean Distance
Normal Atrial Ventricular Average Rhythm Arrhythmia Arrhythmia Accuracy 18 23 22 0.944
0.783
0.773
0.825
0.944
0.826
0.636
0.794
0.944
0.826
0.773
0.841
0.944
0.826
0.727
0.825
0.944
0.826
0.636
0.794
0.944
0.826
0.773
0.841
0.944
0.826
0.727
0.825
1.000
0.826
0.636
0.810
0.944
0.783
0.727
0.810
0.944
0.826
0.682
0.810
0.833
0.739
0.636
0.730
10
D. Olszewski
Table 2 shows results very similar to the results of the previous part of our experiments. And, what is most interesting, the same effect can be observed due to the direction of asymmetry, in the case of asymmetric dissimilarities. In one of the directions of asymmetry (we call it as the “correct” direction), the asymmetric dissimilarities outperform symmetric ones, while in the other direction (“incorrect” direction), they provide lower clustering performance.
5
Summary
This paper presented an improvement to the k-means clustering algorithm. We proposed application of the asymmetric dissimilarities in this algorithm, as more consistent with the behavior of the algorithm, than most commonly employed symmetric dissimilarities, e.g., the Euclidean distance. We claim that asymmetric measures are more suitable for k-means technique, because it evaluates dissimilarity between two distinct entities (object vs. cluster centroid). Consequently, we wanted to assert that asymmetric dissimilarities, in certain areas of research, can be regarded as superior over their symmetric counterparts, on the contrary to the frequent opinion regarding them as the mathematically inconvenient quantities.
References 1. Steinhaus, H.: Sur la Division des Corp Matériels en Parties. Bulletin de l’Académie Polonaise des Sciences, C1. III 4(12), 801–804 (1956) 2. MacQueen, J.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967) 3. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An Efficient k-Means Clustering Algorithm: Analysis and Implemetation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002) 4. Biau, G., Devroye, L., Lugosi, G.: On the Performance of Clustering in Hilbert Spaces. IEEE Transactions on Information Theory 54(2), 781–790 (2008) 5. Olszewski, D., Kolodziej, M., Twardy, M.: A Probabilistic Component for KMeans Algorithm and its Application to Sound Recognition. Przeglad Elektrotechniczny 86(6), 185–190 (2010) 6. Okada, A., Imaizumi, T.: Asymmetric Multidimensional Scaling of Two-Mode Three-Way Proximities. Journal of Classification 14(2), 195–224 (1997) 7. Okada, A.: An Asymmetric Cluster Analysis Study of Car Switching Data. In: Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Heidelberg (2000) 8. Okada, A., Imaizumi, T.: Multidimensional Scaling of Asymmetric Proximities with a Dominance Point. In: Advances in Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 307–318. Springer, Heidelberg (2007) 9. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000), circulation Electronic Pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215
Gravitational Clustering of the Self-Organizing Map Nejc Ilc and Andrej Dobnikar Faculty of Computer and Information Science, University of Ljubljana Tržaška cesta 25, 1000 Ljubljana, Slovenia {nejc.ilc,andrej.dobnikar}@fri.uni-lj.si
Abstract. Data clustering is the fundamental data analysis method, widely used for solving problems in the field of machine learning. Numerous clustering algorithms exist, based on various theories and approaches, one of them being the well-known Kohonen’s self-organizing map (SOM). Unfortunately, after training the SOM there is no explicitly obtained information about clusters in the underlying data, so another technique for grouping SOM units has to be applied afterwards. In this paper, a contribution towards clustering of the SOM is presented, employing principles of Gravitational Law. On the first level of the proposed algorithm, SOM is trained on the input data and prototypes are extracted. On the second level, each prototype acts as a unit-mass point in a feature space, in which presence of gravitational force is simulated, exploiting information about connectivity gained on the first level. The proposed approach is capable of discovering complex cluster shapes, not only limited to the spherical ones, and is able to automatically determine the number of clusters. Experiments with synthetic and real data are conducted to show performance of the presented method in comparison with other clustering techniques. Keywords: clustering, self-organizing map, gravitational clustering, data analysis, two-level approach.
1
Introduction
Clustering is an unsupervised process of organizing data into natural groups or clusters, such that objects or data points, assigned to the same cluster, have high similarity, whereas the similarity between objects assigned to different clusters is low [1]. Clustering techniques have been widely used in the fields of data mining, feature extraction, function approximation, image segmentation, and others [2]. Kohonen’s self-organizing map (SOM) is one of more successful neural network approaches for clustering, which has been applied to a broad range of applications in the previously mentioned fields [3]. Actually, the SOM is not only a clustering method – it is also a popular data exploratory and visualization tool since it is capable of mapping d-dimensional input space to m-dimensional output space, where m d and usually m=2 or m=3. The SOM consists of a set of A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 11–20, 2011. c Springer-Verlag Berlin Heidelberg 2011
12
N. Ilc and A. Dobnikar
neurons arranged in a 2 or 3-dimensional structure, usually in a rectangular or hexagonal grid with defined neighborhood. Through unsupervised training, the SOM folds and fits onto input data points preserving their density and topology. Such trained map of neurons can be used as powerful visualization surface, convenient for detection of interesting regions or clusters. The number of neurons in the SOM is usually much greater than the number of clusters in the underlying data. Hence, the main problem is to find a meaningful grouping of neurons and to obtain a good insight into the structure of the data as a consequence. In the past, different attempts were made towards the clustering of the SOM. In [4], the SOM is clustered using two methods: k-means and hierarchical agglomerative clustering algorithm. In both cases, significant running time reduction is shown compared to direct clustering of data. However, clustering quality is not improved as it has not been the purpose of research. The opposite is the case in [5], where superior clustering accuracy is achieved using maps with a huge number of neurons. Consequently, increased time complexity of algorithm has to be taken into account. Another interesting approach, which is able to automatically determine the number of clusters, is proposed in [6]. It employs recursive flooding algorithm for a detection of the clusters in the SOM. However, the results of the experiments are not convincing – in a comparison, simple k-means clustering of the SOM performs better on average. The paper presents a new method for clustering the SOM using gravitational approach, which assumes that every point in the data set can be viewed as a mass particle. If gravitational force between points exists, they begin to move towards each other with respect to mass and distance, thus producing clusters. This natural inspired notion was firstly used in the algorithm proposed by Wright [7] and recently extended in [8]. In our proposed algorithm, which is called GSOM, the basic idea from the latter is used and integrated with the SOM, considering the connections between neurons. The goal of our research is to develop an efficient clustering method capable of dealing with arbitrary shaped clusters, where the number of clusters has to be determined automatically, without user interaction. At the moment, GSOM can handle only numeric data due to the limitations of the implemented SOM algorithm. The rest of the paper is organized as follows. In Section 2, the proposed algorithm GSOM is described. Section 3 presents performance of GSOM in comparison with three other clustering algorithms over nine synthetic and real data sets. Results are presented along with discussion. Finally, the conclusion is drawn in Section 4.
2
Proposed Algorithm GSOM
Clustering algorithm GSOM is based on a two-level approach depicted in Fig. 1. First, a set of prototypes is created using the SOM as a vector quantization method. Each data point belongs to its closest prototype called best matching unit (BMU). Data points with common BMU, acting as their representative,
Gravitational Clustering of the SOM
13
Fig. 1. Two-level scheme of GSOM. a) Input data set. b) SOM is trained and BMUs are identified (black circles). Interpolating units (empty diamonds) are eliminated together with connections. c) BMUs are interpreted as mass points and moved around under influence of gravitational force. Merging occur when two points are close enough. d) As a result, final clustering is obtained - different markers are used for different clusters.
form a first-level cluster. There are several times more prototypes than the expected number of clusters. On the next level, prototypes are observed as movable objects in a feature space, where a force of gravity moves them towards each other. When two prototypes are close enough, they merge into a single prototype with mass unchanged, due to the reason explained later in this section. The main benefit using the SOM on the first level of the proposed algorithm is to obtain topological ordered representatives of the data. Prototypes are connected with each other in a grid and neighbors for each of them are known. We use this valuable and often omitted information to bound influence of gravitational field to prototype’s close neighbors and therefore stabilize and enhance gravitational clustering process on the second level. Another advantage of the SOM is a reduction of noise. The prototypes are local averages of the data and therefore less sensitive to random variations. Finally, SOM with properly chosen number of neurons reduces computational complexity of clustering data, especially when a huge number of input points is a case as shown in [4]. 2.1
SOM Algorithm
The SOM is a regular two-dimensional grid a×b of M = a · b neurons. Each neuron is represented by a prototype vector mi = [mi1 , . . . , mid ], where d is the dimension of input space. The neurons are connected to the adjacent ones with a neighborhood relation. Each neuron, except the ones on the border of the map, has four or six direct neighbors, depending on choosing rectangular or hexagonal grid structure, respectively. Before the training, linear initialization of the SOM is made in the subspace spanned by the two eigenvectors with the greatest eigenvalues computed from the original data. For initialization and training of the SOM, the SOM Toolbox
14
N. Ilc and A. Dobnikar
for MATLAB1 was used. In our case the SOM is trained in a batch mode, where the whole data set is presented to the SOM before the adjustments are made to prototype vectors. In each epoch, the data set is partitioned according to the Voronoi regions of the neurons. Each data point xj belongs to the neuron to which it is the closest. After this, the prototype vector of neuron i is updated as N j=1
mi (t + 1) = N
hic(j) (t)xj
j=1
hic(j) (t)
,
(1)
where c(j) is the BMU of data point xj and N is the number of points in the data set. The new value of the i-th prototype vector mi is computed as weighted average of all data points, where each data point’s weight is a value of neighborhood kernel function hic(j) (t) centered on its BMU c(j). We used Gaussian neighborhood kernel function with a width defined by parameter σ that decreases monotonically in time. Initial value of σ is σ0 = max{1, max8 a,b }. The a and b are chosen, such that the ratio a/b is approximately equal to the square root of the ratio between the two largest eigenvalues of data in the input space. The SOM is trained in two phases: a rough phase with number of epochs lr = max{1, 10 · M/N } and a fine-tuning phase, √ where number of epochs is lf = max{1, 40 · M/N }. Above, M = S · 5 · N , where S is a scale factor set to 1 by default. Values of σ0 , lr , lf and M are heuristically determined as proposed by the authors of the SOM Toolbox. As a result on the first level of algorithm we obtain prototypes which represent the original data. Interpolating prototypes which are not BMU for any data point are eliminated together with the connections to their neighbors. This proves to be very beneficial in a sense of widening the gap between distant regions of map units. Therefore, only BMUs are taken onto the second level of GSOM. 2.2
Gravitational Clustering
Identified BMUs from the first level of algorithm are now being interpreted as ddimensional particles in the d-dimensional space with mass equal to unity. During iterations, each particle is being moved around according to a simplified version of the Gravitation Law using the Second Newton’s Motion Law as proposed in [8]. The new position of point x influenced by gravity of point y is x(t + 1) = x(t) +
G d · , ||d||2 ||d||
(2)
where d = x(t) − y(t) is the Euclidean distance between points x and y, and G is the parameter of gravity, which is decreased by factor ΔG at each iteration, following the rule: G = (1 − ΔG) · G. When two points are moved close enough, i.e. ||d|| is lower than parameter α, they are merged into a single point with mass equal to unity. This principle ensures that clusters with greater density do not 1
SOM Toolbox for MATLAB is available under the GNU General Public License at: http://www.cis.hut.fi/somtoolbox/.
Gravitational Clustering of the SOM
15
affect smaller or less dense ones. The experiments presented in Section 3 prove that such approach is beneficial. It is obvious that the number of points decreases during iterations, when the appropriate G is chosen. At each iteration of an algorithm, every point x in the remaining set of points denoted with P is considered once. Then we have to choose another point y and move both of them according to Eq. 2. As both points are actually BMUs, taken from the SOM, point y can be selected in two ways: either from the neighbors of x, if any of them exist, or as a random point from the set P , not equal to the point x. With probability 1 − p, one of the existing neighbors of the x is randomly chosen and with probability p a random point from P is selected, where p is a parameter of an algorithm. When p is small, the point’s movement is more influenced by its closest neighbors and when p is large, the information of locality is less important. The algorithm stops when G is reduced to a value, where movements of all remaining points are under particular threshold value. Alternative stopping criterion is a case when a predefined maximum number of iterations is reached or when only two points remain in set P . The last criterion implicitly means that we want to split the data in at least two groups, which is a reasonable assumption. Points, remaining in the set P are final clusters representatives. Each representative may contain one or more BMUs and therefore all data points they cover. Obviously, the number of discovered clusters depends on a data set’s features and input parameters’ values. Therefore, the GSOM determines the number of clusters automatically, without predefining it. The essential step is the selection of the parameters, which will be considered in the next section.
3
Experiments and Results
Experiments were conducted over synthetic data sets Giant, Hepta, Ring, Wave, Moon, and Flag and real data sets Iris, Wine, and LetterABC. The performance of the proposed clustering algorithm in assessed in comparison with the three selected algorithms: the Expectation Maximization algorithm using mixture of Gaussians (EM GMM) [9], the Cauchy-Schwarz divergence clustering algorithm (CS) [10], and the clustering of the SOM with k-means algorithm (SOMkM) [4]. EM GMM was chosen as a baseline method because of its popularity and efficiency, although it assumes a hyper-spherical shape of clusters. CS algorithm is more advanced in the sense of discovering complex cluster shapes. In addition, the SOMkM method has been included as the representative algorithm of those which perform clustering of the SOM. Table 1 summarizes properties of the data sets and their plots are presented in Fig. 2, showing also the best clustering results obtained by the GSOM algorithm. Note that data sets Hepta, Iris, Wine and LetterABC are plotted using the PCA (principal component analysis) projection due to high-dimensionality of data. Each data set is linearly scaled to fit in range [0, 1] before clustering is carried out.
16
N. Ilc and A. Dobnikar
Fig. 2. Data sets used in experiments. The best clustering results of GSOM are displayed using different shapes or colors of markers. PCA projection is used to visualize Hepta, Iris, Wine, and LetterABC data.
3.1
Data Sets
A short description of the used data sets is given as follows: a) Data set Giant consists of 862 2-D points and has two clusters: one small spherical cluster on the right side with 10 points and one huge spherical cluster with 852 points on the left side. A much greater density of the leftmost cluster, compared to the other one, is a difficulty here, leading algorithms to split the giant instead of finding the dwarf. b) Hepta is a data set with 212 points, which form seven clusters of spherical shape. Each cluster contains 10 data points, except for the middle one, which contains two more points. Hepta is a part of the Fundamental Clustering Problem Suite, available at http://www.uni-marburg.de/fb12/datenbionik/. c) Data set Ring consists of 800 2-D points forming two concentric rings, each containing 400 points. Non-linear separability and sophisticated connectivity are presented here to challenge the methods.
Gravitational Clustering of the SOM
17
d) Data set Wave is generated to measure algorithms’ performance on highly irregular, longitudinal and linearly non-separable clusters. 2-D data consists of 148 points in the upper and 145 points in the lower wavy curve. e) Data set Moon is another problem domain with linearly non-separable clusters. Here, four clusters are defined, containing 104, 150, 150 and 110 2-D points, from the topmost to the lowermost cluster respectively. f) Data set Flag consists of 640 points that form three clusters. The spherical cluster in the middle contains 100 2-D points; the cluster above and the cluster beneath contain 270 points each. g) Iris data set [11] has been widely used in classification tasks. It has 150 points of four dimensions, divided into three classes of an iris plant with 50 points each. The first class is linearly separable from the other two. The second and the third class are overlapped and linearly non-separable. h) Wine data set [11] has 178 13-D points with three known classes of wines derived from three different cultivars. The numbers of data points in the classes are 59, 71 and 48 respectively. i) LetterABC data set is based on a Letter data set from [11], containing only data for identification of letters A, B and C. There are 1719 data points in total with 16 numerical attributes. 3.2
Parameters Setting
As it can be seen from Section 2, six parameters need to be set for the GSOM algorithm to work. Fortunately, it turns out that default values or values, selected with heuristics, can be used for the majority of the cases. Extensive experiments on influence of parameters were conducted, including SOM sizes with four scale factor values S = {0.5, 0.75, 1, 2}, two shapes of SOM grid {rectangular, hexagonal}, five values of G = {4 · 10−4 , 6 · 10−4 , 8 · 10−4 , 9 · 10−4 , 1 · 10−3 }, five values of ΔG = {0.03, 0.04, 0.045, 0.05, 0.06}, five values of p = {0, 0.1, 0.5, 0.9, 1}, and five values of α = {0.001, 0.005, 0.01, 0.05, 0.1}. For each data set a total of 5000 configurations of GSOM parameters are considered. Every clustering result is then evaluated with external measure of quality, called clustering error (CE) [12], defined as percentage of wrongly clustered data points. In order to calculate CE, the optimal covering, relating maximization of intersection between the result of clustering method and desired clustering is considered. The analysis of the results, summarized only briefly here, shows that the following parameters’ values should be taken as default: SOM size with S=1, rectangular SOM grid, G = 0.0008, ΔG = 0.045, α = 0.01, and p = 0.1. In addition, parameter α proves the lowest impact on the quality of clustering; it is followed in increasing order of impact by SOM grid, p, ΔG, S, and G. Table 2 displays the best parameters’ values, which give the minimal CE, for each data set. The parameters of other methods were set as follows. The maximum number of iterations for EM GMM was set to 500 in order to assure convergence. Parameters of CS algorithm were set in accordance with the authors’ suggestions in [10] and [13]. Concerning the parameters of the SOMkM algorithm, a benchmark test of the parameters SOM size and SOM grid, similar to those described for
18
N. Ilc and A. Dobnikar
Table 1. Data sets used for performance measurements. The number of clusters is a man-given ground truth. data points dim. clusters
Giant 862 2 2
Hepta 212 3 7
Ring 800 2 2
Wave 293 2 2
Moon 514 2 4
Flag 640 2 3
Iris 150 4 3
Wine 178 13 3
LetterABC 1719 16 3
Table 2. The best values of the parameters for the GSOM algorithm. SOM size is the size of 2-D grid with the scale factor S given in brackets, SOM grid can be rectangular (rect) or hexagonal (hexa), G is the initial gravitational constant, ΔG the reduction factor of G, α the merging distance and p the probability of choosing a random point instead of a neighbor. data set Giant Hepta Ring Wave Moon Flag Iris Wine LetterABC
SOM size SOM grid 13×11 (S = 1) rect 9×8 (S = 1) rect 11×10 (S = 0.75) rect 14×12 (S = 2) rect 20×10 (S = 2) rect 14×9 (S = 1) rect 12×5 (S = 2) rect 7×5 (S = 0.5) rect 12×9 (S = 0.5) rect
G 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0010
ΔG 0.045 0.060 0.045 0.045 0.045 0.045 0.045 0.030 0.030
α 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
p 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.1 0.1
GSOM, was performed and the values, which give minimal CE, were chosen. All three algorithms require the number of clusters as the input parameter. We set it on the values shown in Table 1. 3.3
Evaluation of Results
Clustering of nine data sets is performed by the proposed GSOM and three other algorithms: EM GMM, CS, and SOMkM. The results are evaluated with respect to the desired clustering, i.e. ground truth, and a clustering error was computed - it’s minimal, maximum and mean value for 100 runs of each algorithm. The average running time is measured on Intel Core2 Duo 2.1 GHz processor with 3 GB of memory and MATLAB version R2007b. The results of the experiments are collected in Table 3 and the best ones of GSOM are visualized in Fig. 2. Considering the minimal and mean clustering error of clustering data sets Giant, Hepta, Wave, Moon, and Flag, GSOM outperforms other methods. Method CS is the only one that achieves perfect result on Ring data set. It is followed by GSOM, which is able to discover the inner circle, while the outer one is partitioned in three clusters. When clustering the Iris data set, the best results are obtained with EM GMM and GSOM, though CS achieves the lowest mean error. EM GMM method is also the most successful in clustering the Wine and the LetterABC data. The latter is obviously the hardest problem for the proposed algorithm GSOM, due to the highest error rate among all compared methods.
Gravitational Clustering of the SOM
19
Table 3. Performance of GSOM algorithm compared to EM GMM, CS and SOMkM. Clustering Error (min/max, mean ± standard deviation) and the average running time (s) are measured for every data set. Data set Giant
Hepta
Ring
Wave
Moon
Flag
Iris
Wine
LetterABC
EM GMM 0.000 / 0.017 0.007 ± 0.002 0.054 0.000 / 0.557 0.254 ± 0.121 0.042 0.418 / 0.500 0.491 ± 0.022 0.397 0.280 / 0.491 0.448 ± 0.069 0.031 0.307 / 0.541 0.421 ± 0.058 0.206 0.000 / 0.641 0.114 ± 0.192 0.060 0.033 / 0.613 0.169 ± 0.165 0.019 0.011 / 0.494 0.268 ± 0.130 0.032 0.068 / 0.601 0.294± 0.114 0.216
CS 0.219 / 0.497 0.404 ± 0.062 78.694 0.000 / 0.269 0.057 ± 0.062 0.900 0.000 / 0.000 0.000 ± 0.000 48.109 0.130 / 0.403 0.237 ± 0.093 3.180 0.000 / 0.465 0.284 ± 0.165 11.011 0.000 / 0.252 0.003 ± 0.025 23.636 0.040 / 0.173 0.072 ± 0.029 0.447 0.056 / 0.427 0.139 ± 0.055 1.071 0.180 / 0.453 0.301 ± 0.096 918.997
SOMkM 0.352 / 0.458 0.457 ± 0.011 0.092 0.000 / 0.542 0.227 ± 0.124 0.032 0.466 / 0.500 0.493 ± 0.010 0.032 0.126 / 0.495 0.393 ± 0.107 0.045 0.288 / 0.521 0.451 ± 0.062 0.030 0.000 / 0.361 0.118 ± 0.163 0.033 0.047 / 0.333 0.087 ± 0.055 0.024 0.051 / 0.056 0.053 ± 0.003 0.130 0.180 / 0.472 0.318 ± 0.049 0.067
GSOM 0.000 / 0.000 0.000 ± 0.000 0.315 0.000 / 0.142 0.003 ± 0.02 0.116 0.288 / 0.395 0.349 ± 0.025 0.204 0.000 / 0.495 0.173 ± 0.131 0.318 0.000 / 0.292 0.048 ± 0.103 0.386 0.000 / 0.000 0.000 ± 0.000 0.215 0.033 / 0.333 0.260 ± 0.096 0.125 0.034 / 0.601 0.253 ± 0.113 0.091 0.361 / 0.571 0.498 ± 0.074 0.346
It is important to stress that the execution times of the GSOM algorithm are shorter than those of the CS for factor of 100 or even 1000 in the case of the LetterABC data set, while error rates are quite comparable in general. The EM GMM and SOMkM are faster than GSOM for approximately 10 times. Except for the data sets Ring and LetterABC, GSOM correctly finds the expected number of clusters.
4
Conclusion
A novel approach of clustering Kohonen’s SOM is presented in the paper, utilizing gravitational clustering in a two-level scheme. According to the results of the experiments, the advantages of the presented method GSOM are as follows. First, GSOM is able to detect and to successfully cluster data of complex shapes with linearly non-separable regions. Second, the proposed algorithm
20
N. Ilc and A. Dobnikar
determines the number of clusters automatically. Finally, the employing SOM on the first level of the algorithm greatly decreases the overall execution time and thus enables the processing of large data sets, which will also be the subject of our further research. Furthermore, the data preprocessing methods have to be studied in order to set the values of GSOM input parameters according to the features of a certain data set instead of using heuristics.
References 1. Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall, Englewood Cliffs (2003) 2. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques. Elsevier, Amsterdam (2005) 3. Kohonen, T.: Self-organizing maps. Springer, Heidelberg (2001) 4. Vesanto, J., Alhoniemi, E.: Clustering of the Self-Organizing Map. IEEE Trans. on Neural Networks 11(3), 586–600 (2000) 5. Ultsch, A.: Emergence in Self Organizing Feature Maps. In: 6th International Workshop on Self-Organizing Maps (2007) 6. Brugger, D., Bogdan, M., Rosenstiel, W.: Automatic Cluster Detection in Kohonen’s SOM. IEEE Trans. on Neural Networks 19(3), 442–459 (2008) 7. Wright, W.E.: Gravitational clustering. Pattern Recognition 9, 151–166 (1977) 8. Gomez, J., Dasgupta, D., Nasraoui, O.: A new gravitational clustering algorithm. In: 3rd SIAM International Conference on Data Mining (2003) 9. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 10. Jenssen, R., Principe, J.C., Eltoft, T.: Cauchy-Schwarz pdf Divergence Measure for non-Parametric Clustering. In: IEEE Norway Section International Symposium on Signal Processing (2003) 11. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html 12. Meila, M., Heckerman, D.: An Experimental Comparison of Model-Based Clustering Methods. Machine Learning 42, 9–29 (2001) 13. Jenssen, R., Principe, J.C., Erdogmus, D., Eltoft, T.: The Cauchy-Schwarz Divergence and Parzen Windowing: Connections to Graph Theory and Mercer Kernels. Journal of the Franklin Institute 343(6), 614–629 (2006)
A General Method for Visualizing and Explaining Black-Box Regression Models ˇ Erik Strumbelj and Igor Kononenko Faculty of Computer and Information Science, University of Ljubljana Trˇzaˇska 25, 1000 Ljubljana, Slovenia {erik.strumbelj,igor.kononenko}@fri.uni-lj.si
Abstract. We propose a method for explaining regression models and their predictions for individual instances. The method successfully reveals how individual features influence the model and can be used with any type of regression model in a uniform way. We used different types of models and data sets to demonstrate that the method is a useful tool for explaining, comparing, and identifying errors in regression models. Keywords: Neural networks, SVM, prediction, transparency.
1
Introduction
Explaining prediction models and their predictions is an integral part of machine learning. The purpose of such methods is making models more informative, easier to understand, and easier to use. These benefits are especially welcome when using non-transparent prediction models, such as artificial neural networks and SVM. Some of the most popular learning algorithms (naive Bayes, decision trees) owe a part of their popularity to their ability to produce models which are inherently easy to interpret. For others, model-specific explanation and visualization methods have been developed [3,5,6]. There also exist general methods that can be applied to any model. The latter are the focus of this paper. Before discussing general explanation methods, we start with a simple example. Figure 1 is an explanation for an instance from the artificial data set testA. Instances from this data set describe the situation involving a student in consultation with a professor about his final mark. The teacher can immediately pass the student or may opt to test the student with additional questions in which case it comes down to the student’s knowledge to determine whether the student will pass. The model’s task is to predict the student’s chances of success. The binary feature teacher describes the teachers action. The feature student describes the student’s knowledge and has 6 possible equally spread levels, where 0 means certain failure, 1 means 20% chance,..., and 5 means certain success. In testA all combinations of pairs of values the two features are equally probable. The explanation in Figure 1 is consistent with our intuition and helps us understand the model’s prediction. Observe how the explanation is given in the form of magnitudes and directions of features’ contributions. Assigning a contribution ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 21–30, 2011. c Springer-Verlag Berlin Heidelberg 2011
22
ˇ E. Strumbelj and I. Kononenko
Fig. 1. The decision tree makes a dire prediction (0.12) for this poorly prepared student (student = 0) who will be tested (teacher = 1). The explanation suggests that both features have an approximately equal contribution. Both are negative, speaking against the student’s chances.
Fig. 2. A general explanation reveals that both features are approximately equally important (grey dots). Studying increases the student’s chances. Not being tested is beneficial while being tested has an opposite effect.
(score, rank, etc...) is a common approach and is used in most of the previously mentioned model-specific methods and in all of the general methods that follow. By using a general method machine learning and data mining practitioners can avoid using a different model-specific explanation method for each different model, which also simplifies comparison. Furthermore, in a practical setting it is very desirable, especially from the end-user’s perspective, that the explanation method need not be replaced if the underlying prediction models change. To achieve such generality methods must avoid anything model-specific, essentially treating every model as a black-box, limiting all interaction to changing the inputs (feature values) and observing the outputs. Clearly, going through all possible combinations of input values is infeasible, so each method is forced in some sort of a tradeoff between its time complexity and the complexity of what it can extract from a model. Some existing methods, such as [7] and [4] use the “one feature at a time” approach. A feature’s contribution for a particular instance is defined as the average change in prediction when the feature’s value is permuted. While this reduces the time complexity, it, in some cases, does not result in a change that reveals the true importance of a feature. Observe how the value of the expression 1 ∨1 does not change if we change either of the 1’s to 0. Both must be changed at the same time to achieve a change. A recently published paper introduces FIRM, a method for computing the importance of features for a given model [9]. For each feature the method observes the variance of the conditional expected output of the model, across all possible values of that feature (conditional to the given value of the feature). However, observe how for two uniformly distributed binary variables E[b1 XOR b2 |b1 = 1] = E[b1 XOR b2 |b1 = 0] = 0.5. The conditional expected outputs will be the same and variance will be 0. A clearly important variable will be assigned 0 importance.
Visualizing and Explaining Black-Box Regression Models
23
A method that solves the problems mentioned in the previous paragraph was recently developed for classification models [8]. The authors’ basic idea is to observe changes across all subsets of features (for example, also observing how the value of 1 ∨ 1 changes if we change both values at the same time). The exponential time complexity is resolved by an approximation algorithm. However, unresolved issues remain. First, it is limited to classification models and can not be used to explain a regression model. Second, it can only be used to explain a particular instance (see Figure 1) - users would benefit from a global overview of how features contribute (see Figure 2). And third, the proposed approximation algorithm is based on a very strict assumption that all combinations of feature values are equiprobable. Successfully dealing with the first two issues and loosening the assumption in the third are the main contributions of this paper. The remainder of the paper is divided into 3 sections. In Section 2 we adapt the explanation method for use with regression models and introduce improvements. Section 3 describes a series of experiments on artificial data sets, followed by an experiment on a real-world data set. With Section 4 we conclude the paper and give some ideas for further work.
2
Explaining Regression Models’ Predictions
Let A ∈ A1 × A2 × ... × An be our feature space, where each feature Ai is a set of values. Let p be the probability mass function defined on the sample space A. Let f : A → be our regression model. No other assumptions are made about f . Let S = {A1 , ..., An }. The influence of a certain subset of features Q ⊆ S in a given instance x ∈ A is defined as Δ(Q)(x) = E[f |values of features in Q for x] − E[f ].
(1)
In other words, the contribution of a subset of feature values in a particular instance is the change in expectation caused by observing those feature values. Suppose we have Δ(Q)(x) for every Q ⊆ S. How do we combine these values to form contributions of individual feature values? In [8] they propose using the well known game-theoretic solution - the Shapley value - to define ϕi (x), the contribution of the i−th feature for instance x: ϕi (x) =
Q⊆S\{i}
|Q|!(|S| − |Q| − 1)! (Δ(Q ∪ {i})(x) − Δ(Q)(x)). |S|!
(2)
Eq. 2 has desirable properties. The feature contributions are implicitly normalized (they sum up to the initial difference ΔS ), which makes them easier to interpret. If a feature does not have any impact on the prediction, will be assigned a 0 contribution. And, features with a symmetrical impact will be assigned equal contributions. The work described so far in this section is credited to [8] and only minor modifications were necessary to apply the method to a regression setting (in our case f is a regression model’s output, instead of a classification model’s probabilistic prediction for a given class value).
24
2.1
ˇ E. Strumbelj and I. Kononenko
Approximation Algorithm
Eq. 2 reflects any influence the feature might have on the prediction. However, in practice it is often impossible to calculate the Δ-terms due to the time complexity. Even if we could, we still face the exponential time complexity of computing 1 ϕi (x). In [8] this is resolved by assuming that p(x) = |A| , for all x ∈ A. For any given feature space this assumption limits the choice of p to a single possibility. The distribution of values plays an important part in how people intuitively explain events. Recall the teacher/student scenario. The concept that students are more likely to pass if they study or are not tested is universal (that is, such a model would perform well on any university with a similar concept, regardless of the distribution of feature values). Our intuitive explanation depends heavily on the distribution of feature values. For example, a student who does not study and is tested will fail. If this teacher tests students most of the time, we would say that it is mostly the student’s own fault for not studying. On the other hand, if the teacher almost never tests a student, most would say it was ”bad luck” (that is, being tested is a much more important contributor than the amount of study). This example emphasizes the importance of providing more flexibility wrt the choice of p, while still retaining an efficient explanation algorithm. To loosen the restriction, we assume that p is such that individual features are mutually independent. Then transform Eq. 1 into Δ(Q)(x) =
p(y) (f (τ (x, y, Q)) − f (y))
(3)
y∈A
Note that τ (x, y, W ) = (z1 , z2 , ..., zn ), where zi = xi iff i ∈ W and zi = yi , otherwise. We use the alternative formulation of the Shapley value (equivalent to Eq. 2) ϕi (x) =
1 Δ(P r i (O) ∪ {i})(x) − Δ(P ri (O))(x) , n!
(4)
O∈π(n)
where π(n) is the set of all permutations of n elements and P ri (O) is the set of all features which precede the i-th feature in permutation O ∈ π(n). By combining Eq. 3 and Eq. 4, we get ϕi (x) =
1 n!
p(y) · (f (τ (x, y, P r i (O) ∪ {i})) − f (τ (x, y, P ri (O)))),
O∈π(N ) y∈A
(5)
which facilitates the use of random sampling and an efficient approximation algorithm (see Algorithm 1). Note that at random refers to drawing each feature’s value at random, according to the distribution of that feature’s values (usually, by sampling from a data set). Note that due to with replacement features with finite and infinite domains are treated identically. Therefore, it can be applied to both nominal and numeric features. Observe the same model’s prediction for the same instance, but from data set testB where the teacher tests the students a vast majority of time
Visualizing and Explaining Black-Box Regression Models
25
Algorithm 1. Approximating ϕi (x), the importance of the i-th feature’s value for instance x and model f . Take m samples. ϕi (x) ← 0 for k = 1 to m do select (at random) permutation O ∈ π(n) and instance y ∈ A
x1 ←
take their values from x
feat. preceding i in O take their values from x
x2 ←
feat. preceding i in O
i
take their values from y
feat. succeeding i in O take their values from y
i
feat. succeeding i in O
ϕi (x) ← ϕi (x) + f (x1 ) − f (x2 ) end for ϕi (x) ←
ϕi (x) m
Algorithm 2. Approximating ψi,j , the global importance of the i-th feature’s value j for model f . Take m samples. ψi,j ← 0 for k = 1 to m do select (at random) instance y ∈ A x1 ← set i-th feature to j, take other values from y ψi,j ← ψi,j + f (x1 ) − f (y) end for ψi,j ψi,j ← m
(Figure 3) and compare to Figure 1. The explanation now depends on the context and the proposed explanation method provides us with explanations which are in accordance with our own intuitive explanation. Figures 1 and 3 show how individual feature values influence the model’s prediction for a given instance. For a global overview of how a feature contributes, we could observe the contributions across several instances. Instead, we provide the same information within a single visualization. We define the global contribution of the i-th feature’s j-th value as the expected value of that feature’s contribution (see Eq. 5) for an instance where its value is j:
ψi,j =
x∈A,x[i]=j
=
1 n!
p(x)ϕi (x) =
O∈π(N ) x∈A
p(x)ϕi (x) =
x∈A
p(x) f (x ) − f (x) = p(x) f (x ) − f (x) ,
(6)
x∈A
where x is x with i-th feature’s value set to j). Eq. 6 can be approximated using Algorithm 2.
26
ˇ E. Strumbelj and I. Kononenko
Fig. 3. If it is likely that the teacher will test the student then studying hard (or lack of) becomes much more important
Fig. 4. KNN1 does not perform well, but the features have a strong influence on its predictions. We can conclude it overfits.
Fig. 5. M5P successfully models dDisj and correctly predicts R = 1. The visualization shows that a single feature is responsible for the prediction, while the other two have the opposite effect.
Fig. 6. The neural network successfully models dXorBin and correctly predicts this instance. The explanation reveals that the first three features are important and all three contribute towards 1.
Figure 2 is a visualization of the global importance of features for our illustrative data set testA. Each grey/black point pair is obtained by running Algorithm 2. The mean of ψi,j samples (black points) reveals the magnitude and direction of the value’s average influence. Standard deviation of ψi,j samples (grey points) is also included for each value to reveal its global importance. For an instance explanation, we repeat Algorithm 1 for each feature. To ensure with a certain probability that the approximated contribution will be within a certain distance from the actual contribution we require a constant number of samples. Therefore, for a given error the number of samples m needed to generate the explanation for a single feature does not increase with the number of features. The same applies to global visualizations, although the constant is larger because we repeat the process for each feature value we plot. The total running time for one explanation is: a constant × the number of features n × the model’s prediction time complexity T (f (x)). The methods time complexity is O(n · T (f (x)). For most regression algorithms T (f (x)) is O(n), which implies quadratic time complexity. Our purpose is to show that the method is a wellfounded and useful tool, which can be used to generate explanations in real-time
Visualizing and Explaining Black-Box Regression Models
Fig. 7. SVM provides the best fit for dPoly. Subsequently, the contributions closely match the actual concepts.
27
Fig. 8. The visualization shows us that MP learned some but not all of the concepts behind the dPoly data set.
(order of seconds) for data sets with up to a few dozen features (already shown in [8]). A more rigorous analysis of the limits of the method wrt the number of features it can handle for a given type of model is delegated to further work.
3
Experimental Verification
We have shown that the method is theoretically well founded and has several desirable properties. But how well does it translate into practice. The time complexity was discussed at the end of Section 2.1. Due length limits, we omit an in-depth analysis of running times in favor of showing more examples. We tested the method using six different regression algorithms: linear regression (LR), a Support Vector Machine for regression (SVM), a multi-layer perceptron with a single hidden layer (MP), k-nearest neighbors (k = 1 and k = 11), a regression tree (M5P), and pace regression (PR). The method was implemented in Java using Weka’s learning algorithm classes [1]. Default parameters were used, with the exception of SVM, where a 2nd degree polynomial kernel was used. A variety of models (in terms of performance and type) is desirable as we can verify if the explanations reveal why they performed well or poorly. Artificial data allow us to test if explanations generated for good models are close to those generated for the optimal model (and vice versa). All feature values lie between 0.00 and 100.00, R is the target variable, each data set has 5 features and those that are not explicitly mentioned have no influence on R. Note that 1000 training and 1000 test samples were generated for each data set. Data sets: dLinear (R = A1 + 2A2 + 3A3 ), dRedund (R = 2A1 − 2A2 ; A3 always has the same value as A2 to create a redundant feature), dLocLin (features A3 and A4 are binary and divide the problem space into 4 locally linear subproblems: R = 5A1 + A2 , ifA3 = 0 ∧ A4 = 0; R = A1 − 4A2 , ifA3 = 0 ∧ A4 = 1; R =
28
ˇ E. Strumbelj and I. Kononenko
Table 1. RRMSE and distances from the explanation for an optimal model (in parentheses). The correlation coefficients between the two are included for each data set. dLinear 0.00 (3.09) MP 0.01 (3.10) SVM 0.01 (3.08) M5P 0.24 (24.12) KNN1 0.34 (19.80) KNN10 0.24 (22.12) PR 0.00 (2.97) coeff 0.942 LR
dLocLin dRedund dTrig dPoly dDisj dXor dXorBin dRand 0.49 0.00 0.83 0.98 0.85 1.00 1.00 1.00 (112.72) (2.33) (0.78) (4.06) (0.17) (0.35) (0.29) (1.82) 0.05 0.02 0.33 0.88 0.72 0.57 0.00 0.99 (13.75) (3.01) (0.20) (3.25) (0.13) (0.17) (0.05) (1.72) 0.13 0.01 0.50 0.13 1.05 0.81 1.60 1.00 (24.24) (2.78) (0.33) (0.67) (0.33) (0.26) (0.81) (3.19) 0.08 0.12 0.18 0.30 0.34 0.30 0.00 1.00 (20.38) (7.20) (0.13) (1.03) (0.03) (0.06) (0.04) (3.40) 0.11 0.16 0.59 0.66 0.75 0.75 0.00 1.00 (25.53) (17.21) (0.35) (2.52) (0.10) (0.23) (0.14) (14.87) 0.11 0.13 0.52 0.60 0.61 0.60 0.26 1.01 (28.73) (11.23) (0.43) (2.24) (0.12) (0.21) (0.16) (5.79) 0.50 0.00 0.79 0.97 0.74 1.00 1.00 1.00 (114.52) (3.25) (0.73) (4.05) (0.16) (0.35) (0.29) (1.89) 0.998 0.927 0.958 0.991 0.911 0.992 0.913 NA
2A1 + 8A2 , ifA3 = 1 ∧ A4 = 0; R = −2A1 − 3A2 , ifA3 = 1 ∧ A4 = 1), dTrig (R = 2πA2 A1 −50 2 A2 −50 2 A3 −50 1 sin( 2πA 100 ) + cos( 100 )), dPoly (R = 2( 25 ) − 3( 25 ) − 25 ), dDisj (R = 1 if (A1 > 50) ∨ (A2 > 40) ∨ (A3 > 60); otherwise R = 0), dXor (an XOR problem, R = (A1 > 50) XOR (A2 > 50) XOR (A2 > 50))), dXorBin (similar to dXor, all five features are binary.,R = A1 XOR A2 XOR A3 ), dRand (R is chosen at random). First, we investigated if the generated contributions reflect what the model learns. We evaluated the models with the relative root mean squared error (RRMSE). For a distance measure1 we used the Euclidean distance between the vector (ϕ1 , ..., ϕn ) and the vector generated when using optimal predictions instead of f . Table 1 shows the results for the described experiment. Some models perform better and some data sets are more difficult. Regardless, explanation quality and model performance are highly correlated. Correlation is not applicable to dRand. All models should have a RRMSE of 1 (any deviations are due to noise). However, some models overfit, which results in explanations away from optimal. For example, KNN1 is likely to overfit. Figure 4 reveals that feature A1 has a substantial influence on the KNN1 model, despite being useless for predicting R. Results confirm that the explanations reflect, at least in an abstract sense, what the models have learnt. We continue by observing some examples and verifying whether the explanations are useful from a user’s perspective. We start with instances from dDisj and dXorBin. Figures 5 and 6 are explanations for M5P on dDisj and MP on dXorBin, respectively. In the introduction we pointed out that these two concepts are representative of what existing general methods are unable to handle correctly. Visualizations show that the proposed method reveals the important features and their contributions. Now we proceed to global visualizations2 . The best model for dPoly was SVM. The explanation (Figure 7) confirms that it fits the data well. The worst were the 1 2
That is, to describe how much the explanations generated for a given model differ from those generated for an optimal model. We left some irrelevant features out of the visualizations, to conserve space.
Visualizing and Explaining Black-Box Regression Models
Fig. 9. LR is most influenced by cement, water, and age. Concrete strength increases with age and amount of cement and decreases with the amount of water.
29
Fig. 10. Similar to LR, cement, water, and age are the most important for the neural network model. However, MP fits the non-linear relationships better.
Fig. 11. For this particular prediction age contributes positively. The amount of water and cement have a negative contribution. Construction experts agree with the explanation and elaborate that the mixture suffers from a high water-to-cement ratio. Least important features were removed.
linear models, which can not fit the polynomial. The MP model is somewhere in between and Figure 8 reveals why. The model learned only a part of the concept, missing the relevance of feature A2 . We conclude the section with a more realistic example of what data mining practitioners encounter in practice. The concrete data set has 9 numeric features - concrete mixture components (in kg/m3 ) and age (in days), and one target feature - compressive strength of the mixture (in MPa). The data were obtained from the UCI repository, where it was made available by prof. I-Cheng Yeh [2]. The compression strength is a highly non-linear problem [2]. Using LR and MP we achieved mean squared errors of 109 and 55, respectively, while predicting with the mean value results in a mean squared error of 279 (we used 10-fold cross-validation). The minimum, maximum, mean, and standard deviation of the compressive strength class variable are 2.3, 82.6, 35.8, and 16.706, respectively.
30
ˇ E. Strumbelj and I. Kononenko
Figures 9 and 10 are visualizations for LR and MP. These are used to reveal the overall importance of individual features and their contribution to the model’s predictions. When interested in a specific prediction, we observe the corresponding instance explanation. For example, Figure 11 is an instance explanation for MP’s prediction for a particular concrete mixture. MP’s prediction for is close to the actual concrete compressive strength, while LR overestimates the compressive strength for this instance and predicts 60 MPa. The explanation reveals which features contribute towards/against compressive strength.
4
Conclusion
The proposed explanation method is simple to implement and can be applied to any regression model. It can explain both the model and its predictions. Results across different regression models and data sets confirmed that the method’s explanations reflect what the models learn, even in cases where existing general explanation methods would fail. The examples presented throughout the paper illustrate that the method is a useful tool for visualizing models, comparing them, and identifying potential errors. With emphasis on the theoretical properties and the method’s usefulness, less attention was given to measuring and optimizing running times. We delegate this to further work, together with an in-depth analysis of running times across different types of models.
References 1. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11(1), 1–3 (2009) 2. I-Cheng, Y.: Modeling of strength of high performance concrete using artificial neural networks. Cement and Concrete Research 28(12), 1797–1808 (1998) 3. Jakulin, A., Moˇzina, M., Demˇsar, J., Bratko, I., Zupan, B.: Nomograms for visualizing support vector machines. In: KDD 2005: ACM SIGKDD, pp. 108–117 (2005) 4. Lemaire, V., Fraud, R., Voisine, N.: Contact personalization using a score understanding method. In: IJCNN 2008 (2008) 5. Moˇzina, M., Demˇsar, J., Kattan, M., Zupan, B.: Nomograms for visualization of naive bayesian classifier. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 337–348. Springer, Heidelberg (2004) 6. Poulet, F.: Svm and graphical algorithms: A cooperative approach. In: 4th IEEE ICDM, pp. 499–502 (2004) ˇ 7. Robnik-Sikonja, M., Kononenko, I.: Explaining classifications for individual instances. IEEE TKDE 20, 589–600 (2008) ˇ 8. Strumbelj, E., Kononenko, I.: An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research 11, 1–18 (2010) 9. Zien, A., Kr¨ amer, N., Sonnenburg, S., R¨ atsch, G.: The feature importance ranking measure. In: Buntine, W., Grobelnik, M., Mladeni´c, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS, vol. 5782, pp. 694–709. Springer, Heidelberg (2009)
An Experimental Study on Electrical Signature Identification of Non-Intrusive Load Monitoring (NILM) Systems Marisa B. Figueiredo, Ana de Almeida, and Bernardete Ribeiro CISUC, Department of Informatics Engineering, University of Coimbra, Polo II, P-3030-290 Coimbra, Portugal {mbfig,amaria,bribeiro}@dei.uc.pt
Abstract. Electrical load disambiguation for end-use recognition in the residential sector has become an area of study of its own right. Several works have shown that individual loads can be detected (and separated) from sampling of the power at a single point (e.g. the electrical service entrance for the house) using a non-intrusive load monitoring (NILM) approach. This work presents the development of an algorithm for electrical feature extraction and pattern recognition, capable of determining the individual consumption of each device from the aggregate electric signal of the home. Namely, the idea consists of analyzing the electrical signal and identifying the unique patterns that occur whenever a device is turned on or off by applying signal processing techniques. We further describe our technique for distinguishing loads by matching different signal parameters (step-changes in active and reactive powers and power factor) to known patterns. Computational experiments show the effectiveness of the proposed approach. Keywords: feature extraction and classification, k-nearest neighbors, non-intrusive load monitoring, steady-state signatures, support vector machines.
1
Introduction
“Your TV set has just been switched on.” This may very well be a sms or email message received on your mobile phone in the near future. For energy monitoring, health care or home automation, concepts like Smart Grids or in-Home Activity Tracking are a recent and important trend. In that context, an accurate and inconspicuous identification and monitoring of electrical appliances consumptions are required. Moreover, such monitoring system should be inconspicuous. Currently, the available solutions for load consumption monitoring are smart meters and individual meters. The first ones supply aggregated consumption information without identifying which devices are on. To overcome this limitation, to use an individual meter for each appliance in the house would be sufficient. However, this would turn out to be an expensive solution for a household. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 31–40, 2011. c Springer-Verlag Berlin Heidelberg 2011
32
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
A non-intrusive load monitoring system (NILM) fulfills all the requirements imposed by the Smart Grids and in-Home Activity Tracking challenges at virtually no cost. NILM is a viable solution for monitoring individual electrical loads: a single device is used to monitor the electrical system and to identify the electric load related to each appliance, without increasing the marginal cost of electricity or needing extra sub-measurements. Nevertheless, only with the present low-cost sensing devices, its full potential could be achieved. The central dominant goal of a NILM system is to identify which are the appliances switched on at a certain moment in time. The signals from the aggregate consumption of an electrical network are acquired and electrical features are extracted, in order to identify which devices are switched on. Each appliance has a particular electrical signature which must be recognized in order to perform an accurate identification. This paper presents a study for its electrical distinctive characterization. The proposed signature is based on the analysis and recognition of steady-states occurring in the active and reactive power signals and the power factor measurements. To evaluate this approach, data from a set of appliances were collected and classified using a Support Vector Machine (SVM) method and the K-Nearest Neighbors (K-NN) technique. The results of the computational experiments indicate that an accurate identification of the devices can be, in fact, accomplished. This paper is organized as follows: the next section presents a brief overview of the related literature. Section 3 describes the concepts behind the NILM system and the electrical signature problem. It proceeds describing the developments associated to the step-changes analysis in an electrical signal and the features that can be used as distinctive marks, introducing a result that enables an algorithm for steady-state recognition. Finally, Section 4 describes the experimental setup, where the new algorithm for feature extraction was used followed by SVM and K-NN classification algorithms and the classification results. Conclusions and future work are addressed in Section 5.
2
Related Literature
To identify the devices switched on at a certain moment in time, a non-intrusive load monitoring system uses only the voltage and current signals of the aggregated electrical consumption using sampling of power at a single point. The concept was independently introduced by Hart [1] (then working at the Electric Power Research Institute) and by Sultamen (Electricité de France) [2]. Over the last decades, due to the pressing environmental and economic issues, the interest in this area has increased, being the focus of PhD theses as [3]. In 1996, the first NILM system was commercialized by the company Enetics, Inc.. The main steps in a non-intrusive load monitoring system are: a) the acquisition of electrical signals, b) extraction of the important events and/or characteristics and c) production of a classifier of electrical events (see Figure 1). To perform the identification, the definition of an electrical device ID is needed. Therefore, the electrical signatures are the basis of any NILM system [1]. These are defined as a set of parameters that can be measured from the total load. For
An Experimental Study on Electrical Signature Identification
33
NILM Signals Electrical
Data Acquisition
Signal Analysis and Feature Extraction
Sampled Electrical Signals
Load Classification
Feature Vectors
Energy Estimation and Activity Inference
Appliance Identification
Appliance Signature Sensing Meter
Load parameters
SVM and 5-NN
Fig. 1. A NILM high-level system and approach for the device signature study
a NILM system, usually these parameters result, either from signal steady-states or are obtained by sensing transients. A steady-state signature is deduced from the difference between two steadystates in a signal consisting of a stable set of consecutive samples, whose values are within a given threshold. The basic steady-state identifies the turning on and off of an electrical device connected to the network. To achieve this steady-state detection is much less demanding that what is required for capture and analysis of the transients. Other advantages are the fact that we can recognize turning off states and when two appliances are on at the same time, it is possible to analyze its signature sum. Hence, they were used by Hart for the prototype presented in [1]. Since then, steady-state signatures were used by several authors, mainly for residential load monitoring systems. In [4,5,6] discrete changes on the active and reactive power are analyzed while [7] only uses the active power. Nevertheless, some limitations can be pointed out, as the impossibility to distinguish two different appliances with the same steady-state signature. The small sampling rate can be also considered a disadvantage: sequences of turning on loads during a period smaller than the sampling rate are not possible to identify. To overcome these limitations, the transient signatures, which result from the noise in the electrical signal caused by the switching on/off of an appliance, can be used. Yet, for transient identification a high sampling rate is needed. Since both steady-state and transient signatures have their own limitations, considering both for a study of a joint ID is interesting. Such was considered by Chang et al. in [8], very recently. The following section describes our approach.
3
Steady-States (StS) Recognition: Proposed Approach
Electrical signatures are the main component of a NILM system. Usually individual load identification uses transients and steady-states (StS) signatures. Due to the high sampling frequency needed for the first, residential NILM systems typically use the latter. However, one of the drawbacks of StS is the fact that distinct appliances can present very similar signatures. In fact, using only the step-changes in the active power, little information is provided which may lead to
34
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
an incorrect identification. This paper studies the incorporation of further signal information in order to enrich the electric profile of each appliance, namely from the reactive power signal and power factor measurements. The first step in the definition of a StS is the recognition of a stable value sequence in the sampled signal. In [9] the authors presented a method for the identification of a steady-state signature based in ratios between rectangular areas defined by the successive states values. The method allows for the identification of a complete steady-state, i.e., when does the StS begin and when does it end. The approach is based on the difference between the rectangular area produced by aggregating a new sample and the one already defined by the previous values in the stable state. However, keeping only the extreme values already in the stable state and testing the new sample value against the previous ones can simplify this approach. This improvement is described in the following. The new result was implemented in order to extract features from power signals: the active, reactive power and power factor signals. 3.1
A Rule for Steady-States Recognition
A sequence of consecutive samples is regarded as a stable-state if the difference between any two samples of the sequence does not exceed a given tolerance value. The minimum number of consecutive samples needed to identify a stable state depends on the sampling frequency: when this is low, a small number of samples is enough, otherwise a bigger number is needed. For instance, with a sampling frequency of 1Hz, the minimum number of samples can be defined as three, which is the one used in [1], where other methods for steady-states recognition are proposed (namely, filtering, differentiating and peak detection). Consider a sequence of n consecutive sampling values, Y = {yi , i = 1, . . . , n} already identified as a steady-state. By definition, |yi − yj | ≤ ∀i, j = 1, . . . , n and i = j, where > 0 is the defined tolerance. Let yM = max {yi } and ym = min {yi } , ∀i = l, . . . , n be the maximum and minimum values, respectively, for Y, and that yr (r = n + 1) is the next sample value. Next we prove that yr maintains the stable behavior of Y only for a limited range of values. Theorem 1. In the conditions above, the n+1 consecutive values form a steadystate iff yM − ≤ yr ≤ ym + , i.e., |yi − yj | ≤ , ∀i, j = 1, . . . , n + 1 . Proof. In fact, if ym ≤ yr ≤ yM , then |yi − yr| ≤ |ym − yM | ≤ , for all i = 1, . . . , n. Consider now that, yM < yr ≤ ym + . For any yi ∈ [ym , yM ], i = 1, . . . , n, we have, |yi − yr | ≤ |ym − yr | ≤ |ym − ym + | = . Thus, the sequence of the n + 1 values, yi , i = 1, . . . , n + 1, forms a steady-state with a new maximum value: yM = yn+1 = yr .
An Experimental Study on Electrical Signature Identification yr ∈ Y
yr ∈ /Y min - max - min
35
yr ∈ /Y max
max + min +
Fig. 2. Range of acceptable values for inserting yr in a previous identified StS
If we assume that yM − < yr ≤ ym , then, using a similar reasoning, we prove that yr = yn + 1 maintains the value stability of the state, and the steady sequence yi , i = 1, . . . , n + 1 as a new minimum value: ym = yn+1 = yr . In all other cases, that is, yr < yM − or yr > ym + , yr does not belong to the steady-state Y since it goes above the maximum tolerance value. Let us consider yr < yM − . Therefore, (Figure 2), yr < yM − ≤ ym ≤ yi ≤ yM . Hence, |yr − yM | > |yM − − yM | = . The remaining case can be proved similarly. In conclusion, a consecutive sample point yr belongs to the steady-state immediately before if yM − ≤ yr ≤ ym + such that ym and yM are the minimum and the maximum values in the state. Otherwise, the previous sample is considered as the end of the steady-state. When all the samples of the signal have been tested for steady-states identification, the method ends by computing the differences between consecutive states and a feature vector is built. 3.2
Defining a Signature
As it was mentioned, the signature composed only by the changes in the active power provides little information for an accurate appliance recognition. The active power (also known as real power) represents the power that is being consumed by the appliances. However, two other electrical parameters can also be used: the apparent power and the reactive power. In a simple alternating current circuit, current and voltage are sinusoidal waves that, according to the load in the circuit, can be in phase or not. For a resistive load the two curves will be in phase and multiplying their values, at each instant, produces a positive result. This indicates that only real power is transferred in the circuit. In case of a reactive load, current and voltage will be ninety degrees out of phase, which suggests that only reactive energy exists. In practice, resistance, inductance and capacitance loads will occur, so both real and reactive power will exist. At last, the product of the root-mean-square voltage values and current values represents the apparent power. The real, the reactive and the apparent powers are measured in Watts, volt-amperes reactive (VAR) and volt-ampere (VA), respectively. See an example for a LCD 20” in Figure 3. The relation between the three parameters is given by S = P 2 +Q2 where S, P and Q represent the apparent power, the active and the reactive powers, severally. The apparent and the real powers are also connected by the power factor. The latter constitutes an efficiency measure of a power distribution system and is computed as the ratio between real power and apparent power in a circuit. It varies between 0, for the purely reactive load, and 1, in case of resistive load.
36
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
Fig. 3. Active, reactive power and power factor for a LCD of 20”
4
Computational Experiments
4.1
Data Collection and Feature Extraction
The data, namely, the active power, voltage, current and power factor signals were acquired using a sensing meter prototype provided by ISA-Intelligent Sensing Anywhere [10]. However, for monitoring several parameters this prototype has a severe limitation for monitoring several parameters: only one parameter value can be supplied at each point in time. This implies the existence of a delay between the values of different parameter types. Another shortcoming is related with the fact that errors in the measurements can eventually occur, resulting in the failure of deliverance of the expected value. To evaluate the effectiveness of the composed signatures, data from several electrical appliances were acquired, presenting 100 milliseconds delay between the several parameters data samples. The parameter data types are: active power, current, voltage and power factor. Therefore, the frequency between each sample data type was of 400 milliseconds. The data for each appliance is collected in four steps: a) during 10 to 15 seconds, signal samples are acquired without the appliance being plugged to the socket; b) the device is plugged in and samples are collected for 15 seconds; c) the apparatus is switched on and it runs for a period of 1 minute1 and d) the appliance is switched off after, a 15 seconds sampling period occurs. For each one of the appliances the process was repeated fifty times. The devices chosen for the experiments were: a microwave, a coffee machine, a toaster, an incandescent lamp and two LCD’s (from the same manufacturer but different models). In order to proceed with StS identification, Theorem 1 from Section 3 was implemented obtaining a recognition algorithm for processing the collected signals. For each one of the different appliances is possible to identify three steady 1
For the coffee machine, the running time is less than a minute corresponding to the time needed for an expresso.
An Experimental Study on Electrical Signature Identification
37
-states: a stable signal before the switch on of any of the devices; one other StS corresponding to the appliances’ operation phase and a last one occurring after switching off. In fact, one LCD in particular presented four different states: it was possible to identify the steady-state related to the standby mode. For any of the four measured parameters, the difference between the identified steadystates was calculated such that a positive/negative value was associated to the switch on/off, respectively. 4.2
Feature Classification Methods and Multi-evaluation Metrics
To assess the performance of the composed signature, the features for the six class problem associated to the switch on were normalized. Classification was performed using Support Vector Machines (SVM) and 5-Nearest Neighbors (5NN) methods. The SVM is developed for solving binary classification problems, nevertheless, in the related literature two main approaches to solve a multiclassification problem can be found: one-against-all and one-against-one [11]. In the first technique, a binary problem is defined by using each class against the remain ones. This implies that m binary classifiers are applied (m > 2, represents the number of different classes). In the other one, m(m−1) binary classifiers are 2 employed , comparing each pair of classes. For a given sample, a voting is carried out among the classifiers and the class obtaining the maximum number of votes is assigned to it. This last approach is supplied in LIBSVM [12], a package available to implement SVM classification. A similar package is SVMLight [13] whereas this uses the multi-class formulation described in [14] and the algorithm based on Structural SVMs [15] to perform multi-class classification. To perform the classification of the composed electrical signatures the oneagainst-all tactic was implemented using SVM and 5-NN methods. For the SVM, the linear kernel and the radial basis function (RBF) with scaling factor σ = 1 were used. For the multi-class classification, the SVMLight available implementation was chosen. A 3-fold cross validation was employed to the data set in order to evaluate the tests performance. The results are reported in Table 1. To assess the tests performance accuracies, macro-average and micro-average were used. On the latter, the F-measure performance is calculated in two different ways: a) the mean value of every F-scores computed for each binary problem (macro-average); b) the global F-measure, calculated from a global confusion matrix computed from the sum of all the confusion matrices related to the binary problems (micro-average). The F-score is an evaluation of a test’s accuracy which combines the and the precision (P ) values of a test. The general recall (R)P ·R formula is Fβ = 1 + β 2 β 2 ·(P +R) . In this paper β = 1 is used, i.e., Fβ is the harmonic mean of the precision and recall. In order to evaluate a binary decision task, we first define a contingency matrix representing the possible outcomes of the classification, namely the true positives (TP - positive examples classified as positive), the True Negatives (TN - negative examples classified as negative), the False Positives (FP - negative examples classified as positive) and the False Negative (FN - positive samples classified as negative). The recall is defined as TP TP T P +F N and the precision is T P +F P .
38
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
Table 1. The mean accuracies and F-scores values for the tests performed using oneagainst-all SVM (linear and RBF kernels) and one-against-all 5-NN SVM one-against-all Linear RBF F1 (%) Acc. (%) F1 (%) Acc. Incandescent bulb 95.2 ± 1.7 98.1 ± 0.0 97.1 ± 2.8 98.5 ± 0.1 Lcd 22 n.d. 83.2 ± 0.1 49.1 ± 21.2 89.6 ± 1.2 Lcd 32 96.0 ± 0.0 99.3 ± 0.5 95.9 ± 1.9 98.4 ± 0.4 Microwave 96.7 ± 1.6 98.7 ± 0.3 97.9 ± 3.6 99.4 ± 0.5 Toaster 97.2 ± 2.8 99.2 ± 0.3 98.2 ± 3.2 99.8 ± 0.4 Coffee Machine n.d. 99.2 ± 0.3 86.9 ± 4.9 99.8 ± 0.9 Average n.d. 96.3 ± 6.4 87.5 ± 19.3 97.5 ± 4.1 Micro-average 76.8 90.0
4.3
5-NN one-against-all F1(%) 99.4 ± 1.1 99.4 ± 0.0 96.0 ± 4.6 96.8 ± 5.6 100.0 ± 0.0 98.0 ± 1.2 98.3 ± 1.6 98.9
Acc. (%) 99.7 ± 0.6 99.0 ± 0.0 97.9 ± 0.8 98.1 ± 1.0 100.0 ± 0.0 100.0 ± 0.0 99.1 ± 0.9
Evaluation Results
For each one of the tests one-against-all (SVM and 5-NN) the F-scores and mean accuracies are illustrated in Table 1. Towards a global evaluation, the macroaverages (mean values of the F-scores), micro-averages and mean accuracies are also presented. As it can be observed, the one-against-all approaches performance is quite effective. In average, we have an accuracy around 96% for the Linear SVM, 97% for the RBF SVM and 99% for the 5-NN. Rather, the multi-class method presents very low accuracy values: around 40%. Micro and macro averages are measures only applied to binary problems. Therefore, in order to compare the SVMLight multi-class test results, micro-averages were computed for the obtained classifications. For that, the results of the six binary problems one-against-all were determined based on the multi-class classification results as well as the respective confusion matrices and F-measures. With respect to the accuracy, the results for the linear and RBF kernels were of 40.57±8.55% and 40.57±8.55%, respectively. The micro average values were computed as previously described resulting in a value of 40.67% for both kernels. Notice that accuracy values provide only global information: high accuracy is not necessarily related to a precise identification of true positives. In fact, micro-averages supply particular information related to the samples classified as true where the multi-class classification scored badly. This may result from the fact that, in the test data sets associated to binary problems used, the number of samples that belong to the class in test is smaller than the remaining ones. Therefore, the number for TN probably will be greater than the number of samples labeled as TP. Actually, cases were no TP are labelled may occur and then the F-score cannot be defined. In our case, for the multi-class SVM the accuracy is low following the respective micro-average while for the remaining tests, both performance metrics are high.
An Experimental Study on Electrical Signature Identification
39
For both one-against-all methods, the good performance indicates that the composed signature can be an accurate description for each one of the appliances in the database. Nevertheless, these findings may be related to the fact that the number of electrical appliances used is still very small. Moreover, all of these appliances have distinct loads with the exception of the LCD devices. Taking a closer look to the multi-class classification tests, the incandescent bulb was misclassified as the toaster and the LCD’s and the coffee machine as the microwave. To overcome this limitation, other methods for multi-class classification can be studied, like the neural networks, a hybrid approach or even more features can be added to signature, for instance, information related to the transient signals.
5
Conclusions and Future Work
The project for the deployment of Smart Energy Grids requires automatic solutions for the identification of electrical appliances. The issue of implementing in-home activity modeling and recognition relies in cheap and inconspicuous recognition systems. The most suitable solution for both problems is a system NILM. Moreover, such a scheme can also be used as a household electrical management system. For implementing a NILM system, based in the sampling of power at a single point, feature extraction techniques and classification methods are needed to detect and to separate individual loads. This can only be accomplished through the definition of an effective electrical signature. This work begins by presenting an approach to perform the identification of the step-changes of the electrical signal. This strategy was applied to the analysis of the signals acquired for a given data set of appliances, in order to extract features for the definition of an electrical ID for each device. The features use the step-changes in active and reactive powers and power factor. In order to evaluate the proposed approach, we used SVM and 5-NN classification oneagainst-all tests as well as SVM multi-class classification tests. The results show that the simplest methods are able to accurately tackle the recognition issue. This work constitutes an experimental test case study for a composed steadystate signature. Future work will acquire more steady-states IDs in order to increase the data set and perform more ambitious tests. The incorporation of the transient pattern associated with each appliance in the signature is under study. The first problem to overcome is the very high sampling frequency required to obtain transients, not easy to achieve unless a specific sensing device is developed. Another research question that needs to be answered addresses the consumption variations of a device operating in its intermediate state. The proper identification of this variation with the respective device can bring an added-valued for the analysis of the information provided by a NILM system.
Acknowledgments The authors would like to thank ISA for the collaboration and to iTeam project for the support grant given.
40
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
References 1. Hart, G.W.: Nonintrusive appliance load monitoring. Proc. of the IEEE 80, 1870– 1891 (1992) 2. Sultanem, F.: Using appliance signatures for monitoring residential loads at meter panel level. IEEE Transactions on Power Delivery 6, 1380–1385 (1991) 3. Leeb, S.B.: A conjoint pattern recognition approach to nonintrusive load monitoring. PhD thesis, Massachusetts Institute of Technology (1993) 4. Cole, A., Albicki, A.: Data extraction for effective non-intrusive identification of residential power loads. In: Instrumentation and Measurement Technology Conf., IMTC 1998. Conf. Proc. IEEE, vol. 2, pp. 812–815 (1998) 5. Cole, A., Albicki, A.: Algorithm for non intrusive identification of residential appliances. In: Proc. of the 1998 IEEE Intl. Symposium on Circuits and Systems, ISCAS 1998, vol. 3, pp. 338–341 (1998) 6. Berges, M., Goldman, E., Matthews, H.S., Soibelman, L.: Learning systems for electric consumption of buildings. In: ASCE Intl. Workshop on Computing in Civil Engineering, Austin, Texas (2009) 7. Bijker, A., Xia, X., Zhang, J.: Active power residential non-intrusive appliance load monitoring system. In: AFRICON 2009, pp. 1–6 (2009) 8. Chang, H.H., Lin, C.L., Lee, J.K.: Load identification in nonintrusive load monitoring using steady-state and turn-on transient energy algorithms. In: 2010 14th Intl. Conf. on Computer Supported Cooperative Work in Design, pp. 27–32 (2010) 9. Figueiredo, M., de Almeida, A., Ribeiro, B., Martins, A.: Extracting features from an electrical signal of a non-intrusive load monitoring system. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds.) IDEAL 2010. LNCS, vol. 6283, pp. 210–217. Springer, Heidelberg (2010) 10. ISA Intelligent Sensing Anywhere, S.: Isa intelligent sensing anywhere (2009), http://www.isasensing.com/ [Online; accessed 18-October-2010]. 11. Fauvel, M., Chanussot, J., Benediktsson, J.: Evaluation of kernels for multiclass classification of hyperspectral remote sensing data. In: Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 2, pp. 813–816. IEEE, Los Alamitos (2006) 12. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm 13. Joachims, T.: Making large-scale svm learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MITPress, Cambridge (1999) 14. Crammer, K., Singer, Y., Cristianini, N., Shawe-taylor, J., Williamson, B.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292 (2001) 15. Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: ICML 2004: Proc. of the twenty-first Intl. Conf. on Machine learning, p. 104. ACM, New York (2004)
Evaluation of a Resource Allocating Network with Long Term Memory Using GPU Bernardete Ribeiro1,2 , Ricardo Quintas2 , and Noel Lopes2 1
2
Department of Informatics Engineering, University of Coimbra, Portugal CISUC - Center for Informatics and Systems of University of Coimbra, Portugal
Abstract. Incremental learning has recently received broad attention in many applications of pattern recognition and data mining. With many typical incremental learning situations in the real world where a fast response to changing data is necessary, developing a parallel implementation (in fast processing units) will give great impact to many applications. Current research on incremental learning methods employs a modified version of a resource allocating network (RAN) which is one variation of a radial basis function network (RBFN). This paper evaluates the impact of a Graphics Processing Units (GPU) based implementation of a RAN network incorporating Long Term Memory (LTM) [4]. The incremental learning algorithm is compared with the batch RBF approach in terms of accuracy and computational cost, both in sequential and GPU implementations. The UCI machine learning benchmark datasets and a real world problem of multimedia forgery detection were considered in the experiments. The preliminary evaluation shows that although the creation of the model is faster with the RBF algorithm, the RAN-LTM can be useful in environments with the need of fast changing models and high-dimensional data. Keywords: Incremental Learning, GPU Computing.
1
Introduction
The amount of data available on Internet appears to be growing exponentially with time. In addition the complexity of data created by non-stationary underlying processes poses many challenges in the machine learning area. To extract relevant information humans need help from methods based on incremental learning where neural networks can be optimal in many application domains. The most promising strategy for incremental learning is the memory-based learning approach where almost all training samples are stored in memory and then are used in each learning step [5]. In many incremental learning problems this strategy is useless because the number of training samples is not known in advance. To overcome this limitation, a Resource Allocating Network with Long-Term Memory (RAN-LTM) has been proposed in [4]. In RAN-LTM, not only training data but also memory items stored in the long-term memory are trained. In some of the tasks involved computation can be rather intensive and ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 41–50, 2011. c Springer-Verlag Berlin Heidelberg 2011
42
B. Ribeiro, R. Quintas, and N. Lopes
time consuming. With the release of friendly frameworks to program Graphics Processing Units (GPUs) many applications which needed high-processing power found new ways to speed-up its execution. One field that has greatly benefited from this technical progress is machine learning. CUDA (Compute Unified Device Architecture) and its C-like language interface thus enabled parallel implementations of neural network algorithms easing the computation that is heavily data-dependent. In this study, we compare two different learning strategies, the batch and the incremental learning, exploiting the high-performance SIMD architecture of GPU computing. For testing we run the experiments using the UCI machine learning repository and high-dimensional data from a real world problem of audio steganalysis. In this problem the aim is to detect hidden messages which are embedded in audio WAV files. While the traditional methods commonly build a static steganalysis model unable to adapt to new behavior patterns, adaptive detection models with self-learning ability, dynamically update to new changing data. The results have shown that the GPU based RAN-LTM has lessened the computational costs for audio forgery detection. The paper is organized as follows. Section 2 describes the incremental learning with long term memory algorithm (RAN-LTM), and presents shortly the tailored kernels needed for GPU computing. In Section 3 we introduce the experimental setup. The results will be discussed and analyzed in this section, taking into account each algorithm in both platforms, CPU and GPU. Finally Section 4 summarizes the conclusions and points out lines for future work.
2
Incremental Learning
Incremental learning is an important technique, specially in todays environments where a fast response to changing data is necessary. One algorithm that follows the incremental learning model, and uses RBF units in its hidden layer is the Resource Allocating Network(RAN) [5]. The network learns by allocating new units and adjusting the parameters of existing units. If the network performs poorly on a presented pattern, then a new unit is allocated that corrects its response. Thus, the units in this network respond to only a local region of the space of input values. One variation of this algorithm has been investigated [6,4] to prevent the catastrophic interference [2], which occurs when new training disrupts existing memory. A Long-Term Memory (LTM) is then added which has proven to perform well in incremental learning environments. The RAN-LTM algorithm automatically allocates RBF units in the hidden layer on an online basis. The long-term memory also stores samples from the training data, to perform the update of the weights without losing generalization capabilities. The samples stored in LTM are called memory items, and correspond to an input-output pair of the training data [6].
Evaluation of a Resource Allocating Network with Long Term Memory Input layer
Hidden layer
x1
43
Output layer
z1
x2 x3
z2
x4
Long-Term Memory Retrieve & Learn
Generate & Store
Fig. 1. RAN-LTM network architecture with (I = 4, J = 5, K = 2)
2.1
Resource Allocating Network with Long Term Memory
We follow the notation given in [6]. Let us denote the input vector x = {x1 , x2 , . . . , xI }T , the vector of RBF outputs y = {y1 , y2 , . . . , yJ }T and the network output vector z = {z1 , z2 , . . . , zK }T , respectively, for I number of inputs, J number of RBF outputs, and K network outputs. The RAN-LTM proceeds as follows: ||x − cj ||2 yj = exp − (j = 1, · · · , J) (1) σj2 zk =
J
wkj yj + bk
(k = 1, · · · , K)
(2)
j=1
where cj = {cj1 , . . . cjI }T and σj are, respectively, the center and the width of the jth RBF, wkj is the connection weight from the jth unit to the kth output and bk is the bias of output k. The items in the LTM (see Figure 1) correspond to input-output representative pairs selected from training data. The procedure retrieves these pairs when learning a new training data in order to suppress catastrophic interference. The training of the RAN-LTM network is divided into two phases, (i) the allocation of RBFs and (ii) the calculation of the weights. The weight calculation W = {wjk } is similar to the standard RBFN except that instead of the complete target training vector t, only the targets from the training samples stored in the LTM T and the target d of the sample being trained are used. Therefore, to
44
B. Ribeiro, R. Quintas, and N. Lopes
minimize the errors one needs to solve ΦW = Z where Z is the matrix whose column vectors correspond to the target of the sample being trained and the stored M memory item targets. In order to solve W, Single Value Decomposition (SVD) is used. To calculate the width we use the same heuristic rule as in [6]. Initially a maximum value is set σmax using the training data x and targets d: σmax = mediani {minj (||xi − xj ||)} for di = dj
(3)
Subsequently, the width updates are performed whenever, a new RBF unit J is added, and adjusted as follows: σJ = min {minj (||cJ − cj ||), σmax }
(4)
σj = min(||cJ − cj ||, σj ) (j = 1, ..., J − 1)
(5)
Algorithm 1 describes the method used to implement a Resource Allocating Network with Long Term Memory (RAN-LTM) [5]. Algorithm 1. RAN-LTM for all xi ∈ X do z ← j wj ϕj (||xi − cj ||) + b k ← argminj D(cj , xi ) E ←d−z if E > and ||xi − ck || > γ then Allocate new unit: cnew ← xi , wnew ← E Update widths Store memory item: Inew ← xi , Tnew ← d Increment number of memory items:M ← M + 1 else Update weights using memory items z ← j wj ϕj (||xi − cj ||) + b k ← argminj D(cj , xi ) E ←d−z if E > then Allocate new unit: cnew ← xi , wnew ← E Update widths Store memory item: Inew ← xi , Tnew ← d Increment number of memory items: M ← M + 1 end if end if end for
2.2
Parallel Implementation
One of the most important (and basic) units in a CUDA program are the threads which are executed in kernels. To parallelize the algorithms RBFN and RANLTM the main task was to define and implement the kernels for each algorithm.
Evaluation of a Resource Allocating Network with Long Term Memory
45
RBFN Network Kernels 1. KernelActivationMatrix. Calculates the activation between the samples and the hidden units. The threads are organized as in the standard matrix multiplication, one thread for each element of the matrix. 2. Weights calculation. Creates the pseudoinverse of the activation matrix using CULATools SVD, and performs multiplications to obtain the final matrix with CUBLAS routines. 3. Adjust Widths. Calculates distances between all centers with one thread for each element of the matrix, then applies the RNeighbours algorithm with one thread for each row and stores the result in an array as new width values. KMeans. The implementation of the KMeans on CUDA is depicted in figure 2. The following kernels perform the necessary computations for finding the centers. 1. KernelEuclidianDistance. Calculates the Euclidian distance between two matrices. The threads are organized as in the standard matrix multiplication, one thread for each element of the matrix. 2. KernelCenterAttribution. Creates N threads, where N is the number of samples, one thread for each row. The index of the minimum value in the row matrix corresponds to the nearest center. 3. KernelPrepareCenterCopy. Finds the assigned points for each center, with one thread for each center. 4. KernelCopyCenters. Averages all points attributed to a center and replaces old centers. 5. KernelReduce. Compares the assignment array with the previous iteration. Uses a reduction pattern. 2.3
RAN-LTM Kernels
The kernels for the RAN-LTM algorithm have been carefully customized for supporting GPU implementation. 1. KernelCalculateNetworkActivation. Computes the activation of one center to a given sample, one thread per center. 2. KernelSumActivations. Sums up the result of all center activations. 3. KernelFindNearestCenter. Finds the nearest center to a given sample. 4. KernelCalculateError. Calculates the error between target and result of the sum of the network activation. 5. KernelUpdateWidths. Update widths, one thread for each unit. The kernel for calculating the weights is similar to the one implemented for the RBFN algorithm. The RAN-LTM algorithm is able to use several parallel constructs, however there is a huge amount of data transfer from the host to the GPU card per each sample presented. Another issue is that both the error and the minimum distance to the centers must be passed to the host, in order to decide how the algorithm proceeds.
46
B. Ribeiro, R. Quintas, and N. Lopes Number of Centers Average correspo ndent sample for assigned points
Centers
Distance Matrix
Old assignments
Training Data
Assign each sample to the min center distance
Number of samples
Features
Compare arrays via reduction. Stop algorithm if equal.
Fig. 2. KMeans on GPU
3
Experimental Results and Discussion
The datasets, the hardware platforms, and the performance metrics are described followed by the discussion of the results. 3.1
Experimental Setup
We run the algorithms with the UCI [1] machine learning benchmarks downloaded from http://archive.ics.uci.edu/ml/. The datasets in Table 1 were chosen for comparison with the results in [6] with respect to accuracy, meanwhile ensuring the correctness and performance of our algorithm’s implementations. Table 1. Configuration parameters for RBFN and RAN-LTM models in UCI data UCI data RBFN RAN-LTM Dataset Samples Features Class Network Size RNeighbors Accuracy Distance Satellite 6435 36 6 150 4 0.35 5 BreastCancer 569 30 2 25 2 0.5 5 Vehicle 846 18 4 30 2 0.4 5 Vowel-context 990 13 11 35 3 0.3 3 CMC 1473 9 3 10 2 0.5 10 Iris 150 4 3 6 1 0.4 5
Regarding the case study of audio steganalysis, aiming to detect and recover hidden messages from tampered media, the datasets have been arranged as follows [3]. The original medium (cover) has been imperceptibly modified to embed
Evaluation of a Resource Allocating Network with Long Term Memory
47
Table 2. WAV audio signal datasets: (cover, class2) and (stego, class1) Filename cover6000mono hide4pgp25mono invislbe50stero lsbmatching50 steghide1005 steghide993
ID Samples 4390 1 6000 2 4886 3 6000 4 1003 5 993
Hiding Algorithm Class 2 Hide4PGP V4.0 1 Invisible Secrets 1 LSB matching 1 Steghide 1 Steghide 1
encrypted messages by using a shared key, and the receiver can extract and decrypt messages from the modified carriers (steganogram). In [3] feature extraction is performed and audio steganograms created by several signal processing techniques. A total number of 58 features were extracted and stored in 5 files, one for the cover class and the remaining 4 for the stego class. The data set contains 6000 mono 44.1-kHz 16-bit quantization in uncompressed, PCM coded WAV audio signal files, covering different types such as digital speech, on-line broadcast in different languages, for instance, English, Chinese, Japanese, Korean, and Spanish, and music (jazz, rock, blues). Each audio has the duration of 19 s. The stego-audio signals datasets have been built by hiding different messages in the audio signals . For hiding data several algorithms (tools Hide4PGP V4.0, Invisible Secrets, LSB matching and Steghide) were used. These datasets are summarized in Table 2. For each algorithm two platforms, CPU and GPU, were used. The testing setup consisted of two GPUs (each with 14 streaming multiprocessors (SM)) the NVIDIA GeForce GTX470 (448 cores, processor clock 1215MHz) and the NVIDIA GeForce 9800GT (112 cores, processor clock 1500MHz), and with an Intel Core 2 Duo E8400 processor running at 3.0GHz. The tests were done using Ubuntu 9.04 operating system with CUDA Toolkit 3.1 and CULATools 2.0 libraries. The performance metrics were calculated in terms of (i) classification performance (accuracy and F-measure) and (ii) processing times given by the speedups attained. 3.2
Results and Discussion
For statistical significance we run the algorithms 30 times and averaged the results with the mean and the standard deviation. All datasets were scaled with z-score normalization. Normalization is an important data transformation, since it prevents the attributes with initial larger ranges from outweighing the attributes with initial lower ranges. Table 3 illustrates the final classification accuracies of RBFN, RAN-LTM and RAN-LTM Tabuchi [6] in the benchmarks tested. In Table 4 the processing times for both the batch learning with RBFN and the incremental learning RAN-LTM are presented for the UCI benchmarks. We observe that for both tasks in RBFN, namely, finding the centers and adjusting the network weights, the GPU takes
48
B. Ribeiro, R. Quintas, and N. Lopes
Table 3. Final classification accuracy and F-measure for RBFN, RAN-LTM and RANLTM Tabuchi’s models. The best accuracy is written in bold. RBFN RAN-LTM RAN-LTM Tabuchi [6] Dataset Accuracy F-measure Accuracy F-measure Accuracy Satellite 97 90 92 76 89.5 BreastCancer 96 96 94 94 96.2 Vehicle 86 71 82 65 76.3 Vowel-context 95 73 90 44 92 CMC 65 51 59 40 48.1 Iris 89 84 94 91 na
Table 4. Performance for both learning models (batch and incremental) on UCI data Processing Time (s) RBFN Dataset CPU 17.35 Satellite (2.60) 0.37 Breastcancer (0.01) 0.39 Vehicle (0.03) 0.39 Vowel-context (0.02) 0.34 CMC (0.02) 0.27 Iris (0.02)
Centers 9800GT GTX470 4.68 0.67 (0.53) (0.09) 0.07 0.02 (0.01) (0.00) 0.10 0.03 (0.01) (0.00) 0.11 0.03 (0.01) (0.00) 0.15 0.05 (0.03) (0.01) 0.02 0.01 (0.00) (0.00)
CPU 19.45 (2.90) 0.05 (0.00) 0.11 (0.00) 0.17 (0.01) 0.03 (0.00) 0.00 (0.02)
RAN-LTM Weights 9800GT GTX470 9.57 8.37 (0.04) (0.05) 0.05 0.04 (0.00) (0.00) 0.09 0.09 (0.00) (0.00) 0.16 0.14 (0.01) (0.01) 0.13 0.13 (0.01) (0.01) 0.01 0.01 (0.00) (0.00)
CPU 656.98 (117.12) 2.19 (0.26) 59.10 (7.22) 13.25 (1.23) 167.99 (31.58) 0.81 (0.03)
9800GT 884.05 (101.34) 6.93 (0.58) 95.39 (7.68) 57.75 (3.98) 205.07 (16.85) 1.10 (0.08)
GTX470 803.95 (81.99) 5.36 (0.34) 88.73 (6.43) 52.63 (3.54) 178.84 (23.51) 0.90 (0.07)
advantage over the CPU by 44% in the 9800 GT device and by around 56% in the GTX470. Notice that these improvements in processing time are averaged over those two tasks. Meanwhile in the RAN-LTM the times are slightly worst since these data sets are too small. In Figure 3 we can see that in both algorithms, for smaller network sizes, the CPU presents better results. However, with the increase in the network size, the GPU starts to get an edge over the CPU, until finally surpassing performance wise. Likewise comparing the RBFN and RAN-LTM we can observe that in case of an update to the model, the RBFN algorithm would have to rebuild all the network taking much more time than the RAN-LTM. Using the GPU for a network size of 100, the RBFN would take approximately 4 seconds, while the RAN-LTM would take a fraction of this time about 0.045 seconds. Moreover the classifier performance is competitive for the cases tested. We present the performance of the RAN-LTM in a real-world application of audio steganalysis. Steganography is the art of concealed writing, where information can be hidden in unsuspected sources, like images, video and audio. We applied our algorithm to the detection of hidden messages in audio files.
Evaluation of a Resource Allocating Network with Long Term Memory 60
50
RBFN Network size increase
RANLTM Overtime
0.09
RBFN KMs CPU RBFN KMimp CPU RBFN KMs GTX470 RBFN KMNimp GTX470 RBFN KMs 9800GT RBFN KMimp 9800GT
49
RAN-LTM CPU RAN-LTM GTX470 RAN-LTM 9800GT
0.08 0.07
40
0.06
Time(s)
Time(s)
0.05
30
0.04
20
0.03 0.02
10 0.01
00
50
100
150
Network Size
0.00 0
200
20
40
(a) RBFN
60
80
Network Size
100
120
140
(b) RAN-LTM
Fig. 3. Processing times for different network sizes Times
12000
CPU GTX470 9800GT
10000
Time(s)
8000
6000
4000
steghide993 pca 10
steghide993
steghide993 pca 20
steghide1005 pca 10
steghide1005
steghide1005 pca 20
sbmatching50 pca 10
lsbmatching50
sbmatching50 pca 20
vislbe50stero pca 10
invislbe50stero
vislbe50stero pca 20
e4pgp25mono pca 10
hide4pgp25mono
0
e4pgp25mono pca 20
2000
Fig. 4. Processing times for RAN-LTM in WAV data files
The results showed competitive accuracies as compared to other algorithms [3] meanwhile attaining speedups of up to 15× with the CUDA implementation as illustrated in Figure 4. The advantage is that it may be useful in rapid changing environments.
4
Conclusions and Future Work
We have implemented both the batch (RBFN) and the incremental learning with long term memory (RAN-LTM) algorithms using the GPU graphic card. By exploiting the multi-thread capability of multi-core processors our approach
50
B. Ribeiro, R. Quintas, and N. Lopes
has been tested with data sets from the UCI benchmarks and with a real-world data set for multimedia forgery detection. The GPU-based RBFN batch algorithm performed better for the smaller benchmark datasets, while for the larger (and more difficult) audio steganalysis data, the RAN-LTM parallel version yields higher speedups than the sequential counterparts. The performances (for all cases tested) were statistically competitive as compared to the results in literature. Although the creation of the model is faster with the RBFN algorithm, the RAN-LTM can be useful in environments with the need of fast changing models and high dimensional data. Future work will operationally optimize towards better GPU support.
References 1. Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/~ mlearn/{MLR}epository.html 2. Carpenter, G.A., Grossberg, S.: The ART of Adaptive pattern recognition by a self-organizing neural newtoek. IEEE Computer 21, 77–88 (1988) 3. Liu, Q., Sung, A.H., Qiao, M.: Temporal derivative-based spectrum and melcepstrum audio steganalysis. IEEE Transactions on Information Security 4(3), 359– 368 (2009) 4. Okamoto, K., Ozawa, S., Abe, S.: A fast incremental learning algorithm of RBF networks with long-term memory. In: IJCNN 2003: Proc. of the International Joint Conference on Neural Networks, vol. 1, pp. 102–107. IEEE Computer Society, Los Alamitos (2003) 5. Platt, J.: A resource-allocating network for function interpolation. Neural Computation 3(2), 213–225 (1991) 6. Tabuchi, T., Ozawa, S., Roy, A.: An autonomous learning algorithm of resource allocating network. In: Corchado, E., Yin, H. (eds.) IDEAL 2009. LNCS, vol. 5788, pp. 134–141. Springer, Heidelberg (2009)
Gabor Descriptors for Aerial Image Classification Vladimir Risojevi´c, Snjeˇzana Momi´c, and Zdenka Babi´c Faculty of Electrical Engineering, University of Banja Luka, Bosnia and Herzegovina s
[email protected], {vlado,zdenka}@etfbl.net
Abstract. The amount of remote sensed imagery that has become available by far surpasses the possibility of manual analysis. One of the most important tasks in the analysis of remote sensed images is land use classification. This task can be recast as semantic classification of remote sensed images. In this paper we evaluate classifiers for semantic classification of aerial images. The evaluated classifiers are based on Gabor and Gist descriptors which have been long established in image classification tasks. We use support vector machines and propose a kernel well suited for using with Gabor descriptors. These simple classifiers achieve correct classification rate of about 90% on two datasets. From these results follows that, in aerial image classification, simple classifiers give results comparable to more complex approaches, and the pursuit for more advanced solutions should continue having this in mind. Keywords: Aerial image classification, Gabor filters, Gist descriptor.
1
Introduction
There is a constantly increasing number of instruments for remote sensing of the Earth. Consequently, many databases of remotely sensed data are being flooded with data. At the moment, images dominate these databases, both in variety and quantity. Remote sensing imaging of the Earth is done by a variety of airborne and space-borne imagers in various spectral bands, ranging from visible spectrum to microwave [8]. There are many applications of remote sensing imaging, both military and civilian. Civilian applications include land use planning, weather forecasting, studying long-term climate changes, crops monitoring, studying deforestation, city planning, and many others. These applications require development of effective means for acquisition, processing, transmission, storage, retrieval, and analysis of images. One of the key problems in aerial image analysis is the problem of semantic classification. This problem is closely related to the task of land use monitoring which is necessary for control of environmental quality as well as maintaining and improving living conditions and standards. The holy grail of automatic land use classification is pixel-level semantic segmentation of remotely sensed images. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 51–60, 2011. c Springer-Verlag Berlin Heidelberg 2011
52
V. Risojevi´c, S. Momi´c, and Z. Babi´c
The result of a pixel-level segmentation is a thematic map in which each pixel is assigned a predefined label from a finite set. However, remote sensing images are often multispectral and of high resolution which makes its detailed semantic segmentation excessively computationally demanding task. This is the reason why some researchers decided to classify image blocks instead of individual pixels. We also adopt this approach and evaluate classifiers based on the state of the art image descriptors and support vector machines, which have shown good results in image classification tasks, at the task of aerial image classification. The contribution of this paper is in the evaluation of Gabor and Gist descriptors for the task of aerial image classification. For the classifier based on Gabor descriptors we propose a kernel based on the distance function proposed for Gabor descriptors. In the experiments we show that the classifier based on Gabor descriptors yields similar or better performance compared to the Gist descriptor based classifier, despite lower dimensionality of the former. We also show that these simple classifiers yield classification performance which is better or comparable with some more complicated classifiers using more features. The paper is organized as follows. In Section 2 we briefly review previous related work. Image representation and classifier are described in Section 3, and experimental results are given in Section 4. In Section 5 we conclude and give ideas for future research.
2
Related Work
There has been a long history of using computer vision techniques for classification of aerial and satellite images. We briefly review here some of the methods that are relevant to our work. Ma and Manjunath [3] use Gabor descriptors for representing aerial images. Their work is centered around efficient content-based retrieval from the database of aerial images and they did not try to automatically classify images to semantic categories. Parulekar et al. [7] classify satellite images into four semantic categories in order to enable fast and accurate browsing of the image database. Fauquer et al. [2] classify aerial images based on color, texture and structure features. The authors tested their algorithm on a dataset of 1040 aerial images from 8 categories. In a more recent work [6], Ozdemir and Aksoy use bag-ofwords model and frequent subgraph mining to construct higher level features for satellite image classification. The algorithm is tested on a dataset of 585 images classified into 8 semantic categories. Our work is in a similar vein, but rather than trying to construct semantic features for image classification we focus on low level features and aerial images. Despite wide use of Gist descriptor [5] in general-purpose image classification, to the best of our knowledge there are not many examples of aerial image classification using Gist descriptor. Exception is work on tree detection by Yang et al. [10], where Gist is used for clustering of images prior to detection phase.
Gabor Descriptors for Aerial Image Classification
3
53
Image Representation and Classifier
In this paper we evaluate two image descriptors, both based on Gabor filters. There is a long tradition of using Gabor descriptors in computer vision and image processing, dating back to Daugman [1] who noted similarity between low level processing in biological vision and Gabor filter banks. Subsequently, Gabor descriptors have been used for various tasks including texture segmentation, image recognition, iris recognition, registration, and motion tracking. In the context of image classification the most notable are its uses for texture classification and retrieval, pioneered by Manjunath and Ma [4], and, more recently, for scene classification using Gist descriptor, as proposed by Oliva and Torralba [5]. 3.1
Gabor Descriptor
Gabor descriptor for an image is computed by passing the image through a filter bank of Gabor filters. Gabor filter is a linear band-pass filter whose impulse response is defined as a Gaussian function modulated with a complex sinusoid, 1 1 x2 y2 g (x, y) = exp − + 2 + 2πjΩx , (1) 2πσx σy 2 σx2 σy where Ω is the frequency of the Gabor function, and σx and σy determine its bandwidth. Gabor showed that these functions are optimal in the sense of minimizing the joint two-dimensional uncertainty in space and frequency [1]. Impulse responses of the filters in a Gabor filter bank are dilated (scaled) and rotated versions of the function (1). Filters in a Gabor filter bank can be considered as edge detectors with tunable orientation and scale so that information on texture can be derived from statistics of the outputs of those filters [4]. We can consider (1) as a mother Gabor wavelet, and the functions obtained by its dilations and rotations are Gabor wavelets. For a given image, I (x, y) , (x, y) ∈ Ψ (Ψ is the set of image points), the output of a Gabor filter bank is actually Gabor wavelet transformation of that image, which can be written as ∗ Wmn (x, y) = I (x1 , y1 ) gmn (x − x1 ) (y − y1 ) dx1 dy1 , (2) Ψ
where gmn (x, y) are Gabor wavelets at scale m and orientation n, obtained from (1), and asterisk denotes complex conjugation. Assuming that image regions have homogeneous texture, means μmn and standard deviations σmn of the transform coefficients are used to represent the texture of the region: μmn = |Wmn (x, y)| dxdy , (3) Ψ
σmn
2 = (|Wmn (x, y)| − μmn ) dxdy . Ψ
(4)
54
V. Risojevi´c, S. Momi´c, and Z. Babi´c
Gabor descriptor is now formed as a vector of means and standard deviations of filter responses
x = μ00 σ00 μ01 σ01 · · · μ(S−1)(K−1) σ(S−1)(K−1) , (5) where S is the total number of scales, and K is the total number of orientations. These values are typically set heuristically, through cross-validation. In [4] a distance metric based on the weighted L1 -norm is proposed for computing the dis-similarity between textures: d (xi , xj ) = dmn (xi , xj ) , (6) m
n
where
μ(i) − μ(j) σ (i) − σ (j) mn mn
mn
mn dmn (xi , xj ) =
+
,
α (μmn ) α (σmn )
(7)
and α (μmn ) and α (σmn ) are the standard deviations of the respective features over the entire database. 3.2
Gist Descriptor
Oliva and Torralba proposed Gist descriptor [5] to represent the spatial envelope of the scene. The spatial envelope is a set of holistic scene properties which can be used for inferring the semantic category of the scene, without the need for recognition of the objects in the scene. The Gist descriptor of an image is computed by first filtering the image by a filter bank of Gabor filters, and then averaging the responses of filters in each block on a 4 × 4 nonoverlaping grid. Comparing this descriptor to the Gabor descriptor, we see that Gist descriptor is essentially a spatial layout of textures. Note that here standard deviations of the distribution of filter responses are not used. Despite its simplicity this descriptor shows very good results in natural scene classification tasks. 3.3
Classifier
As a classifier we use support vector machine (SVM). Since distances of Gabor descriptors are computed using (6) we construct a kernel function starting from this metric as K (xi , xj ) = exp [−d (xi , xj )] ,
(8)
where d (xi , xj ), is given by (6). This kernel function is essentially based on weighted L1 -norm, and it satisfies Mercer condition [9]. For Gist descriptor we follow the approach in [5] and use SVM with radial basis function kernel. We construct a multi-class classifier using N (corresponding to the number of categories) one-vs-all SVMs and selecting the class with maximal SVM output.
Gabor Descriptors for Aerial Image Classification
4
55
Datasets and Experimental Results
We tested the described image representations and classifier on two datasets. Both datasets consist of aerial images. The first dataset is our in-house dataset and contains images of the part of Banja Luka, Bosnia and Herzegovina. The second dataset contains images used previously for aerial image classification [2], and we include it here for comparison purposes. 4.1
In-House Dataset
For evaluation of the classifiers we used an 4500×6000 pixel multispectral (RGB) aerial image of the part of Banja Luka, Bosnia and Herzegovina. In this image there is a variety of structures, both man-made, such as buildings, factories, and warehouses, as well as natural, such as fields, trees and rivers. We partitioned this image into 128×128 pixel tiles, and used a total of 606 images in our experiments. We manually classified all images into 6 categories, namely: houses, cemetery, industry, field, river, and trees. Examples of images from each class are shown in Fig. 1. It should be noted that the distribution of images in these categories is highly uneven, which can be observed from the bar graph in Fig. 2. In our experiments we used half of the images for training and the other half for testing. We compute Gabor descriptors at 8 scales and 8 orientations for all images from the dataset. We also tried other combinations of numbers of scales and orientations and chose the one with the best performance. Gabor descriptors, as proposed in [4] are computed for grayscale images. Since images are multispectral we compute Gabor descriptor for all 3 spectral bands in an image, and concatenate the obtained vectors, which yields 3 × 8 × 8 × 2 = 384-dimensional descriptors. For comparison purposes we also compute Gabor descriptors for grayscale (panchromatic) versions of images, which are 8 × 8 × 2 = 128-dimensional. As for Gist descriptors, we obtained the best results with the default setup, ie. a filter bank at 4 scales and 8 orientations. For this descriptor we also compute grayscale variant, which is 4×8×16 = 512-dimensional, and color variant, which results in a 3 × 4 × 8 × 3 = 1536-dimensional descriptor. For testing our classifiers we used 10-fold cross validation, each time with different random partition of the dataset, and averaged the results. Average classification accuracies on all categories are given in Table 1. In the table, Gabor (full) denotes Gabor descriptor as given in (5), while Gabor (mean) denotes descriptor obtained using only means of filter-bank responses. Table 1. Comparison of the classification accuracies for the in-house dataset Descriptor Panchromatic (grayscale) (%) Multispectral (RGB) (%) Gabor (full) 84.5 88.0 Gabor (mean) 80.7 84.5 Gist 79.5 89.3
56
V. Risojevi´c, S. Momi´c, and Z. Babi´c
Fig. 1. Samples of images from all classes. From left to right, column-wise: houses, cemetery, industry, field, river, trees. (Best viewed in color.)
Fig. 2. Per category distribution of images in the in-house dataset
We see that Gist descriptor computed for all spectral bands of an RGB image has the best performance, at cost of high-dimensionality of the descriptor. It is worth noting that much simpler Gabor descriptor, with 4 times lower dimensionality, yields similar performance. Even more interesting is the fact that for grayscale (panchromatic) images Gabor descriptor outperforms Gist. From these results, it is obvious that classifiers benefit from information from various spectral bands. When grayscale images are considered, standard deviations of Gabor filter bank responses provide richer information about the texture of the image, hence its better performance. The importance of this information can be observed from the drop of performance when only means of Gabor filter bank responses are used. Another conclusion is that spatial layout of filter bank
Gabor Descriptors for Aerial Image Classification
57
Fig. 3. Confusion matrix for the in-house dataset using Gabor (RGB) descriptor
Fig. 4. Confusion matrix for the in-house dataset using Gist (RGB) descriptor
responses does not have beneficial influence on the performance of aerial image classifier, as is the case with general scenes [5]. The confusion matrix for Gabor descriptor is given in figure 3. We note that confusions mainly arise between categories which can be difficult even for humans. The most notable examples are houses versus cemetery, because of rectangular structures with strong oriented edges, and river versus field, because both have homogeneous, smooth texture without pronounced edges. It is also important to note that there are not many confusions between natural (river, trees, field) and man-made categories (houses, cemetery, industry). The confusion matrix for Gist descriptor is given in Fig. 4. The same observations we made for the confusion matrix for Gabor descriptor are also valid here.
58
V. Risojevi´c, S. Momi´c, and Z. Babi´c
Table 2. Comparison of the classification accuracies for Window on the UK dataset Method SVM with SVM with Algorithm SVM with
Accuracy (%) Gabor descriptor (RGB) 90.8 Gist descriptor (RGB) 87.1 from [2] 89.4 features from [2] 92.3
Fig. 5. Confusion matrix for Window on the UK dataset using Gabor descriptor
4.2
Window on the UK Dataset
For our second experiment we chose Window on the UK dataset which was also used in [2]. This dataset consists of 1040 64 × 64 pixels aerial images, which are manually classified into the following 8 categories: building, road, river, field, grass, tree, boat, vehicle. There are 130 images per category so the distribution of images into categories in this dataset is uniform, in contrast to our in-house dataset. The authors of [2] also proposed a split into training and test sets of 520 images each. For images from this dataset we computed Gabor descriptor at 8 scales and 8 orientations, as well as Gist descriptor, and then trained a multi-class classifier as described previously. In Table 2 we give the comparison of classification accuracies for this dataset. Again, Gabor and Gist descriptor result in comparable performances, this time with some advantage on the side of Gabor descriptors. This supports our previous findings about descriptive power of these two descriptors. Moreover, we can see that the performance of our classifier with Gabor descriptors is better than the performance of the algorithm proposed in [2], and only slightly worse than the performance of the SVM classifier trained with features from [2].
Gabor Descriptors for Aerial Image Classification
59
The confusion matrix for Gabor descriptor is shown in Fig. 5. We can see that common misclassifications again occur in cases that can also potentially confuse human subjects, such as building versus vehicle and field versus grass. It is important to note that, in this case too, misclassifications rarely occur between natural and man-made categories.
5
Conclusion
In this paper we evaluate two image descriptors, namely Gabor and Gist descriptors, and show that classifiers based on these descriptors show results comparable or better than more complex approaches. Both descriptors have previously shown good results in texture and image classification tasks. As a classifier we use SVM with standard radial basis function kernel, as well as a kernel constructed using a metric function proposed for comparing Gabor descriptors. We show that, for multispectral images, lower dimensional Gabor descriptors show similar or better performance performance than Gist, while, for panchromatic images, Gabor descriptors outperform Gist. This is mainly due to the fact that spatial layout is not such a strong cue for semantic classification of aerial images, but their texture regions are rather spatially homogeneous. Also, Gabor descriptors use standard deviations of filter bank responses, and this richer representation that they provide is another reason for their better performance. Despite its simplicity, classifier based on Gabor descriptors and SVMs with weighted L1 -norm kernel achieves better performance than more complex classifiers trained with color, texture and structural descriptors. This finding calls for a more thorough investigation of descriptors used for aerial image classification since it is possible that state of the art descriptors in other application areas do not show better performance than simpler descriptors on the task at hand. Comparing results of this paper with the literature, we also note that using multiple features does not guarantee better results. Therefore, another important research area, stemming from these results, is feature combination. Obviously, this question needs more elaborate studies that will show what features are needed to adequately represent aerial images, and how they should be combined. Also, the whole community would benefit from more manually annotated ground truth datasets which are publicly available so that the algorithms from various groups can be compared.
References 1. Daugman, J.G.: Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression. IEEE Transactions on Acoustics, Speech and Signal Processing 36(7), 1169–1179 (1988) 2. Fauqueur, J., Kingsbury, N.G., Anderson, R.: Semantic discriminant mapping for classification and browsing of remote sensing textures and objects. In: Proceedings of IEEE International Conference on Image Processing (ICIP 2005), pp. 846–849 (2005)
60
V. Risojevi´c, S. Momi´c, and Z. Babi´c
3. Ma, W.Y., Manjunath, B.S.: A texture thesaurus for browsing large aerial photographs. Journal of the American Society for Information Science 49(7), 633–648 (1998) 4. Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern analysis and Machine Intelligence 18(8), 837– 842 (1996) 5. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42(3), 145–175 (2001) 6. Ozdemir, B., Aksoy, S.: Image classification using subgraph histogram representation. In: Proceedings of 20th IAPR International Conference on Pattern Recognition, Istanbul, Turkey (2010) 7. Parulekar, A., Datta, R., Li, J., Wang, J.Z.: Large-scale satellite image browsing using automatic semantic categorization and content-based retrieval. In: IEEE International Workshop on Semantic Knowledge in Computer Vision, in Conjunction with IEEE International Conference on Computer Vision, Beijing, China, pp. 1873–1880 (2005) 8. Ramapriyan, H.K.: Satellite imagery in earth science applications. In: Castelli, V., Bergman, L.D. (eds.) Image Databases, pp. 35–82. John Wiley & Sons, Inc., Chichester (2002) 9. Vapnik, V.: Statistical Learning Theory. John Wiley, Chichester (1998) 10. Yang, L., Wu, X., Praun, E., Ma, X.: Tree detection from aerial imagery. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2009, New York, NY, USA, pp. 131–137 (2009)
Text Representation in Multi-label Classification: Two New Input Representations Rodrigo Alfaro1,2 and H´ector Allende1,3 1 2
Universidad T´ecnica Federico Santa Mar´ıa, Chile Pontificia Universidad Cat´ olica de Valpara´ıso, Chile 3 Universidad Adolfo Ib´ an ˜ez, Chile
[email protected],
[email protected]
Abstract. Automatic text classification is the task of assigning unseen documents to a predefined set of classes. Text representation for classification purposes has been traditionally approached using a vector space model due to its simplicity and good performance. On the other hand, multi-label automatic text classification has been typically addressed either by transforming the problem under study to apply binary techniques or by adapting binary algorithms to work with multiple labels. In this paper we present two new representations for text documents based on label-dependent term-weighting for multi-label classification. We focus on modifying the input. Performance was tested with a wellknown dataset and compared to alternative techniques. Experimental results based on Hamming loss analysis show an improvement against alternative approaches. Keywords: Multi-label text classification, text modelling, problem transformation.
1
Introduction
Large amounts of text document available on digital format on the web contain useful information for a wide variety of purposes. The amount of digital text is expected to increase significantly in the near future; thus, the need for the development of data analysis solutions becomes urgent. Text classification (or categorisation) is defined as the assignment of a Boolean value to each pair dj , ci ∈ D × C, where D is the domain of documents and C = {c1 , ..., c|C| } is the set of predefined labels [12]. Binary classification (BC) is the simplest and most widely studied case. In BC, a document is classified into one of two mutually exclusive classes. BC can be extended to solve multi-class problems. Moreover, if a document is categorised with either one label or multiple labels at once, it is called a single-label or multi-label problem, respectively [12]. Tsoumakas and Katakis [14] presents a formal description of multi-label methods. In [14], L = {λj : j = 1 . . . l}, where λj corresponds to the j-th label, is the finite set of labels in a multi-label learning task, and D = {f (xi ; Yi ); i = 1...m} ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 61–70, 2011. c Springer-Verlag Berlin Heidelberg 2011
62
R. Alfaro and H. Allende
denotes a set of multi-label training data, where xi is the feature vector and Yi ⊆ L is the set of labels of the i−th example. Methods for solving this problem are grouped into two types, namely, problem transformation and algorithm adaptation. The first type of methods is algorithm-independent; it transforms the multi-label learning task into one or more single-label classification tasks. Thus, this type of method can be implemented using efficient binary algorithms. The most common problem transformation method (PT4) learns |L| binary classifiers Hl : X → {l, ¬l}, one for each different label l in L. PT4 transforms the original data set into |L| data sets Dl:l=1...|L| . Each Dl labels every example in D with l if l is contained in the example or ¬l, otherwise. PT4 yields the same solution for both the single-label and multi-class problems using a binary classifier. For the classification of a new instance x, this method generates a set oflabels as the union of the labels generated by the |L| classifiers HP T 4 (x) = l∈L {l} : Hl (x) = l. The second type of method extends specific learning algorithms for handling multi-label data directly. This extensions are achieved by adjustments such as modifications to classical formulations from statistics or information theory. The pre-processing of documents for better representation can also be grouped in this type. Multi-label classification is an important problem for real applications, as can be observed in many domains, such as functional genomics, text categorisation, music mining and image classification. The purpose of this paper is to present a new representation for documents based on label-dependent term-weighting. Lan et al. [6] propose tf −rf representation for two classes of single-label classification problems. Our representation is a generalisation of the tf −rf applied to multilabel classification problems. This paper is organised as follows. In section 2, we briefly introduce multi-label text classification. In section 3, we analyse text representation. Our proposal for two new methods of representation is illustrated in section 4. In section 5, we compare the performance of our proposal with other algorithms. The last section is devoted to concluding remarks.
2
Multi-label Text Classification
The automatic classification of multi-label text has not been thoroughly addressed in the existing literature. Although many multi-label datasets are available, most of the techniques for automatic text classification consider them only as single-label dataset. One of the first approaches developed was Boostexter, an algorithm based on Boosting for the multi-label case [11]. This algorithm adjust the weights of training examples and their labels in the training phase; labels that are hard (easy) to predict correctly get incrementally higher (lower) weights. Among the proposal presented in [14], problem transformation is the most widely used. However, the automatic classification of multi-label text has been addressed by algorithms that directly capture the characteristics of the multi-label problem. Zhang and Zhou, for example, solved the multi-label problem using Backpropagation for Multilabel Learning (Bp-MLL), using artificial
Text Representation in Multi-label Classification
63
neural networks with multiple outputs. Bp-MLL is derived from Backpropagation by employing a novel error function capturing the characteristics of multilabel learning [16]. Regardless of the solution approaches to the Multi-label problem and the algorithms that solve it, according to Joachims [4], any text classification task has complexities due to the high-dimensional feature space, a heterogeneous use of terms, and a high level of redundancy. Multi-label problems have additional complexities, including a large number of tags per document. These characteristics of a multi-label problem require different methods of evaluation than those used in traditional single-label problems.
3
Problem Representation
The performance of a reasoning system depends heavily on problem representation. The same task may be easy or difficult, depending on the way it is described [3]. The explicit representation of relevant information enhances machine performance. Also, a more complex representation may work better with simpler algorithms. Document representation has high impact on the task of classification [5]. Some elements used for representing documents include N-grams, single-word, phrases, or logical terms and statements. The vector space model is one of the most widely used models for ad-hoc information retrieval, mainly because of its conceptual simplicity and the appeal of its underlying metaphor of using spatial proximity for semantic proximity [9]. Space representation can be conceived has a kernel representation. Kernel methods are an approach for solving machine learning problems. Joachims was among the first author to use kernel-based methods to categorise text [4]. Cristianini et al. utilised the kernel-based approach for representing the vector space model and latent semantic indexing [2]. Similarly, Tsivtsivadze et al. established a mapping of input data into a feature space by means of a kernel function and then used learning algorithms to discover relationships in that space [13]. In the vector space model (VSM), the contents of a document are represented by a vector in the term space d = {w1 ; . . . ; wk }, where k is the size of the term (or feature) set. Terms may be measured at several levels, such as syllables, words, phrases, or any other semantic and/or syntactic unit used to identify the content of a text. Different terms have different importance within a text, and thus, the relevance indicator wi (usually between 0 and 1) represents how much the term ti contributes to the semantics of the document d. For weight terms in the vector space model, word frequency of occurrence in the document can be used as term weight for term-weighting. However, there are more effective methods for term-weighting. The basic information used to derive term-weighting is term frequency, document frequency, or sometimes collection frequency. There are different mappings of text to input space across different text classifications. Leopold and Kindermann, for example, combines mappings with different kernel functions in support vector machines [8]. According to Lan et al.
64
R. Alfaro and H. Allende
Table 1. Variables utilized in a term-weighting in multi-label problem for a term t with |L| labels t t label1 at,λ1 dt,λ1 labelλj at,λj dt,λj label|L| at,|L| dt,|L|
[7], two important decisions for choosing a representation based on VSM are the following. First, what should constitute a term? For example, should it be a subword, word, multi-word or meaning? Second, how should a term be weighted? Term-weighting can be a binary function or term frequency-inverse document frequency (tf −idf ) developed by Salton and Buckley [10], using feature selection metrics such as χ2 , information gain (IG), or gain ratio (GR). Term-weighting methods improve the effectiveness of text classification by assigning appropriate weights to terms. Although text classification has been studied for several decades, term-weighting methods for text classification are usually borrowed from the traditional information retrieval (IR) field, including, for example, the Boolean model, tf −idf , and its variants. Table 1 shows the variables that we will consider in a term-weighting method for multi-label problems. where at,λj is the number of documents in the class λj containing the term t and dt,λj is the number of documents in the class λj that do not contain the term t. 3.1
Bag-of-Words Representation (tf −idf )
The most widely used document representation for text classification is tf −idf [12], where for a two classes problem (where, label1 is class+ and label2 is class− ) each component of the vector is computed as: tf −idftd = ft,d log10
N , Nt
(1)
where ft,d is the frequency of term t in the document d, N = (at,λ1 + dt,λ1 + at,λ2 + dt,λ2 ) is the number of documents, and Nt = (at,λ1 + at,λ2 ) is the number of documents containing the term t. 3.2
Relevance Frequency Representation (tf −rf )
Lan et al. [7] proposed recently tf −rf as an improved VSM representation based on two classes and single-label problems (where, label1 is class+ and label2 is class− ): at,λ1 , tf −rftd = ft,d log2 2 + (2) max 1, at,λ2
Text Representation in Multi-label Classification
65
where ft,d is the frequency of term t in the document d, at,λ1 is the number of documents in the positive class containing the term t, and at,λ2 is the number of in the negative class containing the term t. The function documents max 1, at,λ2 in the denominator allows that the term tf −rftd be not indefinite even if at,λ2 is zero. According to [7], using this representation in different single-label data sets improves the performance of two-class based classifiers. For multi-class problems, [7] used a one-versus-all method. Note that tf −rf representation is for single-label problems and does not consider the frequency information of the term evaluated in other classes. That is, it only considers the relationship of the appearance of the term in the class under evaluation (that is, positive) versus all the other classes (that is, negative).
4
Our Proposal for a New Representation of Multi-label Datasets
On the one hand, tf −idf as a representation of documents considers only the frequency of terms in the document (tf ) and the frequency of terms in all documents (idf ), disregarding the class or label to which the documents belong. On the other hand, tf −rf also considers the frequency of terms in the document (tf ) and the frequency of terms in all documents of the class under evaluation (rf ). That is, in tf −rf , each document is represented by a different vector when assessing if it belongs to a particular class. From a theoretical point of view, this extension of the tf −rf representation of text changes the representation of a document according to the label under evaluation, thereby achieving larger differences between documents belonging to different labels and thus harnessing the performance of binary classifiers. Thus, important information about the frequency in other classes is used, specially when frequency of the term shoes sharp variations as example in Table 2 shows. Table 2. Example of frequency of a term for each label
Frequency
Label 1 Label 2 Label 3 Label 4 Label 5 Label 6 Label 7 Label 8 Label 9 53 76 87 66 62 27 25 28 26
We propose the use of a centrality function μ−Relevance Frequency of a Label, tf −μrf l, over the frequency of a term for each label, is derived from the term frequency and relevance frequency of a given label; as such, it constitutes a new representation based on tf −rf for a multi-label problem. at,l , tf −μrf ltdl = ft,d log2 2 + (3) μ(at,λj/l where μ(at,λj/l is a function over the set at,λj/l = {at,λ1 , ..., at,λl−1 , at,λl+1 , ..., at,|L| }.
66
R. Alfaro and H. Allende
We will consider μ(at,λj/l = max 1, mean(at,λj/l ) for tf −rf l representa tion and μ(at,λj/l = max 1, median(at,λj/l ) for tf −rrf l representation. Such functions give centrality measures, the mean is a classical metric and the median is a robust metric. 4.1
Relevance Frequency of a Label
Relevance frequency of a label, tf −rf l, is derived from the μ−Relevance Frequency of a Label, tf −μrf l; as such, it constitutes a new representation for a multi-label problem. at,l tf −rf ltdl = ft,d log2 2 + (4) max 1, mean(at,λj/l ) In equation 5, the term mean(at,λj/l ) is the average number of documents containing the term t for each document labelled other than l. 4.2
Robust Relevance Frequency of a Label
Robust relevance frequency of a label, tf −rrf l, also is derived from the μ− Relevance Frequency of a Label, tf −μrf l; as such, this is the second new representation for a multi-label problem. at,l tf −rrf ltdl = ft,d log2 2 + (5) max 1, median(at,λj/l ) The use of the median should yield more robust results in datasets containing large differences between the frequency of the occurrence of a term in a given set of labels versus other labels sets under evaluation. 4.3
Classification Method
The proposed term-weighting methods includes information on the frequency of the occurrence of a term t in each set of documents labelled other than the label under evaluation. It is expected that mean(at,λj/l ) and median(at,λj/l ) will be higher if the term t appears more frequently in documents with label λj = l than in documents with other labels λj/l , and they will be lower, in contrast, if the term t is more frequent in documents with labels other than l. Our proposal is based on the tf −rf l and tf −rrf l representations and the SVM binary ensemble. It transforms the multi-label problem into a PT4 form [14], and then for each document d, the tf −rf l and tf −rrf l representations are derived for each label λj and classified using |L| binary classifiers.
5
Experiments
The evaluation of the proposed tf −rf l and tf −rrf l representations was carried out using the Reuters-21578 Distribution 1.09. The Reuters-21578 data set consists of 21,578 Reuters newswire documents that appeared in 1987, less than
Text Representation in Multi-label Classification
67
Table 3. Characteristics of the pre-processed data set. Note that PMC denotes the percentage of documents belonging to more than one class and ANL denotes the average number of labels for each document. Data Number of Number of Vocabulary PMC ANL Set Classes Documents Size First3 3 7,258 529 0.74% 1.0074 First4 4 8,078 598 1.39% 1.0140 First5 5 8,655 651 1.98% 1.0207 First6 6 8,817 663 3.43% 1.0352 First7 7 9,021 677 3.62% 1.0375 First8 8 9,158 683 3.81% 1.0396 First9 9 9,190 686 4.49% 1.0480
half of which have human-assigned topic labels. The data set and the validation mechanism used are the same as in [16], that is, the subsets of the k classes with the largest number of articles are selected for k = 3, . . . , 9, resulting in seven different data sets denoted as First3, First4, . . . , First9. Also, in this test 3-fold cross-validation is run ten times on each data set. Our classification method reports the average values among ten runs. Table 3 shows the data set characteristics. First, we must transform the problem into a PT4 form, dividing the data into k input data sets for k = 3, . . . , 9 binary classifiers, whereby each machine classifies one-against-others labels. Four representations were constructed from the data set, namely, the classical tf −idf and tf −rf representations and our proposed tf −rf l and tf −rrf l representations. An ensemble of binary SVM classifiers was used. Each machine employed a linear kernel; the parameters were optimised by maximising the classification margin between each pair of classes. The ensemble was implemented with LibSVM [1], where each machine worked with random sampling. Two-thirds of the examples were used for training, and one-third was used for testing. Note that all tf −if d representations are the same, regardless the label under evaluation, while the tf −rf , tf −rf l and tf −rrf l representations are different for each label. Multi-label classification methods require different performance metrics than those used in traditional single-label classification methods. These measures can be grouped into bipartitions and rankings [15]. Since our method is not based on ranking, as in [11] and [16], the evaluation of the results in this research was performed using Hamming loss by considering bipartitions to evaluate how many times an instance-label pair was misclassified. This measure of error is defined as: d
hloss(h) =
1 1 |h(xi )ΔYi |, d i=1 |L|
(6)
where h(xi ) is the set of labels defined by the classifier for the documents, Yi is the original labels of the documents and Δ is the difference between both. Performance is better when hloss(h) is near 0.
68
R. Alfaro and H. Allende
Table 4. Experimental results of SVM Ensembles with tf −idf , tf −rf, tf−rf l and tf −rrf l compared with others learning algorithms in terms of Hamming loss. Bp-MLL* and BoosTexter* as reported by [16] Data set SVM tf-idf SVM tf-rf SVM tf-rfl SVM tf.rrfl Bp-MLL* BoosTexter*
First3 0.02797 0.02814 0.02716 0.02578 0.0368 0.0236
First4 0.02641 0.02687 0.02590 0.02478 0.0256 0.0250
First5 0.02590 0.02611 0.02526 0.02427 0.0257 0.0260
First6 0.02477 0.02522 0.02412 0.02321 0.0271 0.0262
First7 0.02246 0.02287 0.02186 0.02110 0.0252 0.0249
First8 0.02083 0.02118 0.02026 0.01958 0.0230 0.0229
First9 0.01981 0.02012 0.01930 0.01870 0.0231 0.0226
Average 0.02402 0.02436 0.02341 0.02249 0.02664 0.02446
Table 4 shows the different representations and their performance in term of Hamming loss. In this metric, for data set with fewer classes, Boostexter is better than tf −rf l and tf −rrf l for 0.00356 and 0.00218 respectively. For data set with more classes (namely, First5, First6, First7, First8 and First9), tf −rf l is better than the other algorithms. Table 4 also shows that tf −rrf l is better than the other algorithms for data sets with more classes (namely, the First4, First5, First6, First7, First8 and First9). To evaluate the results, as in [16] a test based on the two-tailed paired t-test at the 5 percent significance level was implemented. According to these results, SVM Ens tf −rf l performs better than SVM Ens tf −idf (4.2595 × 10−6 ), SVM Ens tf −rf (2.0376 × 10−7 ) and Bp-MLL (3.74 × 10−2 ). In addition, SVM Ens tf −rrf l performs better than SVM Ens tf −idf (2.5368×10−5), SVM Ens tf −rf (4.2013 × 10−6 ) and Bp-MLL (1.63 × 10−2 ). The p-value shown in parentheses provides a further quantification of the significance level. The results shown in Table 5 show the level of statistic significance as compared to alternative approaches with respect to Hamming loss. We can see that diferences between Boostexter have not statistical significance for data sets with fewer labels (First3, First4, First5), but for data sets with more labels (First6, First7, First8 and First9), Boostexter has the worst performance among all algorithms. Finally, in Figure 1, we show how the different weighting methods discriminate when a term is important for a classifier or not. In this case, using rrf l and rf l the term is weighted to high for labels 1, 2, 3, 4 and 5, and lower for labels 6, 7, 8 and 9. Note that idf does not discriminate when evaluating each label and rf slightly discriminates. Table 5. Statistical analysis of results in terms of p-value on t-student test. NSS mean ”Is Not Statistically Significant”. SVM tf-rfl SVM tf-rf SVM tf-idf Bp-MLL BoosT. SVM tf.rrfl 1.0754 × 10−4 4.2013 × 10−6 2.5368 × 10−5 1.63 × 10−2 NSS SVM tf-rfl 2.0376 × 10−7 4.2013 × 10−6 3.74 × 10−2 NSS SVM tf-rf 4.2595 × 10−6 NSS NSS SVM tf-idf NSS NSS Bp-MLL NSS
Text Representation in Multi-label Classification
69
Fig. 1. Term-weights assigned by different representations for each label
6
Remarks and Conclusions
Multi-label classification is an important topic in information retrieval and machine learning. Text representation and classification have been traditionally addressed using tf −idf due to its simplicity and good performance. Changes in input representation can employ knowledge about the problem, a particular label, or the class to which the document belongs. Other representations can be developed for overcoming a particular problem directly, without transformation. New benchmarks should be used to validate the results; however, the preprocessing of multi-labelled texts must be standardised. In this paper, we have presented the tf −μrf l as a novel text representations for the multi-label classification approach. This proposal was assessed with two new input representation tf −rf l and tf −rrf l. This representation considers the label to which the document belongs. Combining, this problem transformation with algorithm adaptation. The performance of this representation was tested in combination with an SVM ensemble using a known dataset. The results show statistically significant improvement as compared to alternative approaches with respect to Hamming loss. We believe that the contribution of the proposed multi-label representation is due to a better understanding of the problem under consideration. In future studies, we plan to compare our method to other tf −idf representations and to investigate other label-dependent representations and procedures in order to reduce the dimension of the feature space depending on the relevance of each label. Acknowledgement This work has been partially funded by the Research Grants: Fondecyt 1110854 and Research Grant Basal FB0821 ”Centro Cient´ıfico Tecn´ ologico de Valpara´ıso”.
70
R. Alfaro and H. Allende
References [1] Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm [2] Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent semantic kernels. Journal of Intelligent Information Systems 18(2-3), 127–152 (2002) [3] Fink, E.: Automatic evaluation and selection of problem-solving methods: Theory and experiments. Journal of Experimental and Theoretical Artificial Intelligence 16(2), 73–105 (2004) [4] Joachims, T.: Learning to classify text using support vector machines – methods, theory, and algorithms. Kluwer-Springer (2002) [5] Keikha, M., Razavian, N.S., Oroumchian, F., Razi, H.S.: Document representation and quality of text: An analysis. In: Survey of Text Mining II: Clustering, Classifcation, and Retrieval, pp. 135–168. Springer, London (2008) [6] Lan, M., Tan, C.-L., Low, H.-B.: Proposing a new term weighting scheme for text categorization. In: AAAI 2006: Proceedings of the 21st National Conference on Artificial Intelligence, pp. 763–768. AAAI Press, Menlo Park (2006) [7] Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 721–735 (2009) [8] Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Machine Learning 46(1-3), 423–444 (2002) [9] Manning, C., Schutze, H.: Foundations of statistical natural language processing. The MIT Press, Cambridge (1999) [10] Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988) [11] Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Machine Learning, 135–168 (2000) [12] Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002) [13] Tsivtsivadze, E., Pahikkala, T., Boberg, J., Salakoski, T.: Kernels for text analysis. Advances of Computational Intelligence in Industrial Systems 116, 81–97 (2008) [14] Tsoumakas, G., Katakis, I.: Multi label classification: An overview. International Journal of Data Warehouse and Mining 3(3), 1–13 (2007) [15] Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, 2nd edn. Springer, Heidelberg (2010) [16] Zhang, M.-L., Zhou, Z.-H.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge Data Engineering 18(10), 1338–1351 (2006)
Fraud Detection in Telecommunications Using Kullback-Leibler Divergence and Latent Dirichlet Allocation Dominik Olszewski Faculty of Electrical Engineering, Warsaw University of Technology, Poland
[email protected]
Abstract. In this paper, a method for telecommunications fraud detection is proposed. The method is based on the user profiling by employing the Latent Dirichlet Allocation (LDA). The detection of fraudulent behavior is achieved with a threshold-type classification algorithm, allocating the telecommunication accounts into one of two classes: fraudulent account and non-fraudulent account. The accounts are classified with use of the Kullback-Leibler divergence (KL-divergence). Therefore, we also introduce four methods for approximating the KL-divergence between two LDAs. Finally, the results of experimental study on KL-divergence approximation and fraud detection in telecommunications are reported. Keywords: Fraud detection, User profiling, Kullback-Leibler divergence, Mixture models, Latent Dirichlet Allocation.
1
Introduction
There is a number of fraud detection problems, including credit card frauds, money laundering, computer intrusion, and telecommunications frauds, to name but a few. Among all of them, the fraud detection in telecommunications appears to be one of the most difficult, since there is a large amount of data that needs to be analyzed, and, simultaneously, there is only a small number of fraudulent calls samples, which could be used as the learning data for the learning-based methods. Consequently, this problem essentially inhibits and limits an application of the learning-based techniques, like the neural-networks-based classifiers. The problem of fraud detection in telecommunications has been studied in [1,2,3,4,5]. In paper [1], the Gaussian Mixture Model (GMM) is applied for user profiling, and a high fraud recognition rate is reported. The paper [2] employs Latent Dirichlet Allocation (LDA) to build user profile signatures. The authors assume that any significant unexplainable deviations from the normal activity of an individual user is strongly correlated with fraudulent activity. The authors of [3] investigate the usefulness of applying different learning approaches to a problem of telecommunications fraud detection, while in work [4] an expert system is constructed, which incorporates both the network administrator’s expert knowledge, and knowledge derived from the application of data mining A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 71–80, 2011. c Springer-Verlag Berlin Heidelberg 2011
72
D. Olszewski
techniques on real-world data. Finally, the recent study [5] aimed at identifying customers’ subscription fraud by employing data mining techniques and adopting knowledge discovery process, and to this end, a hybrid approach consisting of pre-processing, clustering, and classification phases was applied. The Kullback-Leibler divergence (KL-divergence) between two probability measures P and Q on a continuous measurable space Ω is defined as [6,7]: p def d(P, Q) = p log2 dλ , (1) q Ω where p and q are the density functions of measures P and Q, respectively, while measures P and Q are absolutely continuous with respect to measure λ. Our approach is based on the user profiling technique utilizing LDA, and detecting fraudulent behavior on the basis of binary classification, i.e., classification to one of two classes: fraudulent account and non-fraudulent account. We apply a threshold-type classification algorithm using the KL-divergence. Consequently, our method requires the computation of KL-divergence between two LDAs, which is an unsolved problem. Therefore, this paper focuses also on the issue of approximation of the KL-divergence between two LDAs, introduces four approximation methods, and chooses the most effective one. The fraudulent activity is indicated by crossing the pre-defined threshold. Our technique strongly relies on the user profiling with LDA probabilistic model. Employing LDA for fraud detection in telecommunications was first proposed in [2], however, the difference between [2] and our paper is that we detect whole fraudulent accounts, in contrast to [2], where single fraudulent calls are detected. Consequently, we apply a different classification algorithm. This kind of approach is also useful in real-world fraud detection problems. Recapitulating, this paper proposes: – four methods for approximating the KL-divergence between two LDAs, – a threshold-type classification algorithm for fraud detection in telecommunications. An advantage of our probabilistic approach is that it does not involve the learning process, this way, overcoming associated with it difficulties (insufficient learning data).
2
Using LDA for User Profiling
The choice of this specific probabilistic model of a telecommunication user was motivated with its properties, which provide an accurate description of a user profile. The model is dynamically developed for individuals within a group, and it explicitly captures the assumption of the existence of a common set of behavioral patterns, which can be estimated on the basis of all observed users, along with their user-specific proportion of participation [2]. The model, itself, was introduced in [8].
Fraud Detection in Telecommunications Using KL-Divergence and LDA
73
LDA is a generative probabilistic model for collections of discrete data. It is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of latent classes. The basic idea, derived from [2], is that the accounts are represented as finite mixtures over latent classes, where a class is characterized by a distribution over features of calls, made from the account. As the features, we use destination, start-time, and duration of a call. The accounts are coded as bag of feature-vectors. Procedure 1. An account can be generated from the LDA model using the following procedure: Step 1. Draw the number of iterations N ∼ Poisson(ξ). Step 2. Draw the parameter for account class distribution θ ∼ Dir(α), where α is the parameter of prior Dirichlet distribution over the latent classes. Step 3. For i = 1 : 1 : N : Step a. Draw the class zi , z ∼ Mul(θ). Step b. Draw the feature ai from p(a | zi , β) – a multinomial probability distribution of the vector of features a, conditioned on the class zi , which points the row of the matrix-parameter β, i.e., β zi . The LDA model has two parameters: a vector α = [α1 , . . . , αK ] (the parameter of the Dirichlet distribution) and a K×V -matrix β, which rows are the parameters of the multinomial distributions. K is the number of the latent classes, and V is the number of features in vector a. The variable N is independent of other data generating variables of the model (θ, z), and, therefore, its randomness may be ignored. For the convenience of further considerations, we will assume N ≡ K. The posterior distributions of the hidden variables θ and z are estimated using variational approximation. The model parameters α and β are estimated using variational EM algorithm (α and β maximize the (marginal) log likelihood of the data). Given the parameters α and β, the joint distribution of a latent class mixture θ, a vector of K latent classes z, and a vector of V features a is given by [8]: p(θ, z, a | α, β) = p(θ | α)
K
p(z | θ)p(a | zi , β) ,
(2)
i=1
where p(θ | α) is the Dirichlet probability distribution of the variable θ = [θ1 , . . . , θK ], p(z | θ) is the multinomial probability distribution of the vector z of K latent classes, with vector-parameter θ, and p(a|zi , β) is the multinomial probability distribution of the vector a of V features, with the vector-parameter β zi (zi -th row of the matrix β). At this point, we note that the symbols p and q will be abused throughout this paper, i.e., they will refer to different types of probability distributions, however, always to probability distributions. This kind of notation abuse is common in probability and statistics. It is used, e.g., in [2,8].
74
3
D. Olszewski
KL-Divergence between Multinomial Mixture Models
Since the LDA model incorporates the multinomial mixtures, it will be necessary to evaluate the KL-divergence between them, in order to approximate the KLdivergence between LDAs. We introduce the notion of Multinomial Mixture Model (MMM) referring to the product of multinomial probability distributions. Consequently, a pair of MMMs can be described with the following formulae: p(x) =
Mul(x; Na , Pa ) =
a
q(x) =
pa (x) ,
(3)
qb (x) ,
(4)
a
Mul(x; Nb , Qb ) =
b
b
where Na , Pa and Nb , Qb are the parameters of the distributions p(x) and q(x), respectively. The parameters Na and Nb are the numbers of trials, while the parameters Pa and Pb are the event probabilities. The problem of determining the KL-divergence between two MMMs is analytically intractable. This happens due to the strong statistical dependence between the random variables of each MMM’s component, i.e., each component has the same variable (explained in Section 4). Therefore, an approximation needs to be employed. We propose three methods for approximating the KL-divergence between two MMMs: 1. The nearest pair method. This approach is inspired with the nearest pair method for approximating the KL-divergence between two GMMs discussed in [9]. Hence, we have: dmin MMM (p, q) = min d (pa , qb ) . a,b
(5)
2. The furthest pair method. This is an analogous method as the previous one, with the difference that in this case, the furthest pair in considered. dmax MMM (p, q) = max d (pa , qb ) . a,b
(6)
3. The mixed sum method. In this case, the KL-divergence is computed as the sum of divergences for each of the mixtures’ components. Hence, for the k-component MMMs, we get: dsum MMM (p, q) =
k
d (pj , qj ) .
(7)
j=1
The drawback associated with this method is that both of MMMs must have the same number of components.
Fraud Detection in Telecommunications Using KL-Divergence and LDA
4
75
KL-Divergence between LDAs
We propose three methods for approximating the KL-divergence between two LDAs. Our methods are based on the KL-divergence computation between the components of LDAs, i.e., between the Dirichlet distributions and between the MMMs. The difference between each proposed method consists in the use of different methods for approximation of the KL-divergence between MMMs. We discuss also the Monte-Carlo simulation method, which was used in our experiments as the reference method. We consider a pair of LDAs of the following form: p(θ, z, a | α1 , β1 ) = p(θ | α1 )
K
p(z | θ)p(a | zi , β1 ) ,
(8)
q(z | θ)q(a | zi , β2 ) .
(9)
i=1
q(θ, z, a | α2 , β2 ) = q(θ | α2 )
K i=1
Each LDA can be presented as the three-variable function, which, in turn, can be written as the product of three one-variable functions (Dirichlet distribution and two MMMs): p(θ, z, a) = p1 (θ) p2 (z) p3 (a) ,
(10)
where p1 (θ) = p(θ | α), θ ∼ Dir(α); p2 (z) = p(z | θ), z ∼ MMM(θ); p3 (a) = p(a | z, β), a ∼ MMM(z, β). We will use this form of LDA for approximation of the KL-divergence. According to (1), the functions of product form are mathematically convenient for computation of KL-divergence (logarithm of product, integral over density function). However, the convenience, essentially simplifying the computations, is achieved only if the product components refer to the independent variables. Hence, in the case of the random variables, a statistical independence is expected. In the case of LDA model, the joint distribution (2) implies the statistical dependence between the random variables θ, z, and a. Therefore, the KL-divergence between two LDA models is not analytically tractable, and its determination is possible only on the basis of the approximation. Consequently, our methods can be regarded as the example approaches to such approximation, which assume the statistical independence of the random variables θ, z, and a. Assuming the random variables θ, z, and a are statistically independent, the KL-divergence between two LDAs can be written as follows:
p(θ, z, a) dadzdθ q(θ, z, a) θ z a p1 (θ)p2 (z)p3 (a) = p1 (θ)p2 (z)p3 (a) log2 dadzdθ q1 (θ)q2 (z)q3 (a) θ z a p1 (θ) = p1 (θ)p2 (z)p3 (a) log2 dadzdθ q1 (θ) θ z a
d(p(θ, z, a), q(θ, z, a)) =
p(θ, z, a) log2
76
D. Olszewski
+ + ≈ + =
p2 (z) dadzdθ q2 (z) θ z a p3 (a) p1 (θ)p2 (z)p3 (a) log2 dadzdθ q3 (a) θ z a p1 (θ) p2 (z) p1 (θ) log2 dθ + p2 (z) log2 dz q (θ) q2 (z) 1 z θ p3 (a) p3 (a) log2 da q3 (a) a d(p1 (θ), q1 (θ)) + d(p2 (z), q2 (z)) + d(p3 (a), q3 (a)) p1 (θ)p2 (z)p3 (a) log2
= dDir + dMMM1 + dMMM2 ,
(11)
where dDir = d(p1 (θ), q1 (θ)), dMMM1 = d(p2 (z), q2 (z)), dMMM2 = d(p3 (a), q3 (a)). On the basis of this transformation, three approximation methods are proposed. The difference between them derives from the different methods for approximating the KL-divergence between MMMs, applied in these three methods. 1. The nearest pair method. In this method, the KL-divergence between MMMs is approximated according to the nearest pair method: min min dmin LDA = dDir + dMMM1 + dMMM2 ,
(12)
where dDir can be calculated analytically, according to the formula, given, e.g., in [10]. 2. The furthest pair method. In this case, the KL-divergence between MMMs is approximated according to the furthest pair method: max max dmax LDA = dDir + dMMM1 + dMMM2 .
(13)
3. The mixed sum method. In this case, the KL-divergence between MMMs is approximated according to the mixed sum method: sum sum dsum LDA = dDir + dMMM1 + dMMM2 .
(14)
4. The Monte-Carlo simulation method. In this case, the KL-divergence between two LDAs is approximated in the following way: n
dMC LDA (p, q) =
1 p(xi ) n→∞ log2 −→ d(p, q) . n q(xi )
(15)
i=1
We use n i.i.d. samples xk , k = 1, . . . , n, coming from the LDA model. Each sample xk is a vector xk = [θ, z, a]. Consequently, in each of n iterations, three random variables need to be drawn. For a large number of samples (100K or 1M) this method yields a very accurate approximation. Of course, using this number of samples is associated with a huge computational burden. However, the Monte-Carlo method can be used successfully as a reference method, allowing for evaluation of other methods, discussed in this paper.
Fraud Detection in Telecommunications Using KL-Divergence and LDA
77
In the LDA model, the hidden random variable θ, drawn from the Dirichlet distribution with the vector-parameter α (the first parameter of LDA model), is used as the vector-parameter of the first multinomial distribution. Then, in each of N iterations, the hidden random variable z is being drawn from the first multinomial distribution, and is used to select the row of the matrix β (the second parameter of LDA model), which, in turn, will be used as the vector-parameter of the second multinomial distribution (Procedure 1). Therefore, in order to obtain the parameters of MMM1 and MMM2 , we have computed the expected values of the hidden random variables θ and z, i.e., θ = E [θ], θ ∼ Dir(α); z = E [z], z ∼ Mul(θ).
5
Fraud Detection in Telecommunications
Fraud detection is performed on the basis of classification of accounts into one of two accounts classes: fraudulent account and non-fraudulent account. We propose a threshold-type classification algorithm for detecting fraudulent activity in telecommunications. Each account is profiled with the LDA probabilistic model, described in Section 2. The detection is achieved by evaluating of the KLdivergence between the reference account’s model, and a model of an account, being currently classified. A fraud is alarmed, when the pre-defined threshold is crossed. The reference account should represent the possibly most typical telecommunication user’s behavior. The threshold value is set arbitrary. Our classification algorithm can be illustrated in 2-dimensional space with Fig. 1. Figure 1 presents ten LDA models of telecommunication accounts, among which, two are detected as fraudulent, i.e., points representing these accounts lay outside of the circle determined by the reference model (center) and the threshold (radius). 5
1
10 2
3 Reference Model
4 6 9
7
8
Threshold (radius)
Fig. 1. Graphical illustration of the proposed classification algorithm
78
D. Olszewski
6
Experiments
In the first part of our experiments, we have investigated the accuracy of the proposed methods for approximating the KL-divergence between two LDAs. In the second part, we have conducted a telecommunication fraud detection experiment. 6.1
KL-Divergence Approximation between Two LDAs
We have evaluated the accuracy of three methods, proposed in this paper, by comparing them with the Monte-Carlo method run for 1K samples. The parameters α and β of the simulated LDA models were generated randomly, i.e., the entries of the vector α were drawn from the uniform distribution, from the interval [0, 5], while the entries of the matrix β were drawn from the uniform distribution, from the interval [0, 1]. The rows of β were normalized (they are the parameters of the multinomial distributions). The experiments have been conducted for three and five latent classes. For each of these cases, five LDA models were investigated (Fig. 2). 20
30
Monte−Carlo Nearest Pair Furthest Pair Mixed Sum
18
25
KL−Divergence
KL−Divergence
16
Monte−Carlo Nearest Pair Furthest Pair Mixed Sum
14
12
10
8
6
20
15
10
4 5 2
0
1
2
3
4
Test Number
(a) Three latent classes
5
0
1
2
3
4
5
Test Number
(b) Five latent classes
Fig. 2. Results of KL-divergence approximation
The highest approximation accuracy was reported for the mixed sum method. The experiments have shown that the mixed sum method, for three and five latent classes, provides the similar accuracy to the Monte-Carlo simulation method, hence, regarding the obvious fact of a much lower computational complexity, we can assert that this method outperforms the Monte-Carlo method, and provides an efficient and effective way for approximating the KL-divergence between two LDAs. 6.2
Fraud Detection in Telecommunications Results
In the experiments, the performance of the proposed telecommunications fraud detection method was assessed by a comparison with the GMM-based
1
1
0.9
0.9
Fraud Detection Rate
Fraud Detection Rate
Fraud Detection in Telecommunications Using KL-Divergence and LDA
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
79
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
False Alarm Rate
(a) Our method (AUROC=0.9833)
1
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Alarm Rate
(b) GMM-based method (AUROC=0.9111)
Fig. 3. ROC curves for our method and for GMM-based method
fraud detection method. The GMM-based method is discussed in [1], and employs the GMM probabilistic models for user profiling. As the basis of the comparison, the Receiver Operating Characteristics (ROC) curves were employed. The ROC curves show the fraud detection probability (true positive rate) as the function of false alarm probability (false positive rate). In order to evaluate a specific curve, the Area Under ROC (AUROC) metric was introduced. It simply measures the area under the curve, and a perfect value of AUROC is 1. Also an important value in assessment of the ROC curve is the highest fraud detection rate corresponding to zero false alarm rate (HDZF), and, the lowest false alarm rate corresponding to maximal (i.e., 1) fraud detection rate (LFMD). The experiments were carried out on data set consisting of a hundred telecommunication accounts, among which twenty were fraudulent. Each account was represented with one hundred call data records (CDRs). Each CDR contains an information about a specific call, made by a specific user. Hence, CDR is a vector of features of a specific call, such as: destination, start-time, or duration. For each account the LDA and GMM models, profiling the telecommunication user, were built on the basis of a hundred CDRs. The building of the LDA model is discussed in Section 2. Each GMM model consisted of three Gaussians, each corresponding to different feature in CDRs. Hence, the building of the GMM model consisted in estimation of the parameters μ (mean) and σ 2 (variance) of each Gaussian component. Each CDR consists of three features. First feature is the destination of a call (local, trunk, international, premium, toll-free, mobile), second is the start-time (8-17, 17-22, 22-8), and third is the duration (<1 min., 1 min.<3 min., 3 min.< 7 min., >7 min.). The number of latent classes was set to three. For each LDA and GMM model, we computed the KL-divergence from the reference model, in order to detect the models of fraudulent accounts. We used the mixed sum method in case of LDAs, and the nearest pair method in case of GMMs. The ROC curves were obtained by the sustainable decrease of the
80
D. Olszewski
threshold value, and observing the number of successful fraud detection, against the number of false alarms. Figure 3a shows the ROC curve for our method, while the Fig. 3b shows the ROC curves for GMM-based fraud detection method. The ROC curves are accompanied by the corresponding AUROC values. The results of our empirical study show that our method outperforms the GMM-based fraud detection method. It produces the higher AUROC value, and, also higher HDZF and LFMD values. The AUROC value, obtained with our method, was 0.9833, while for the GMM-based method, it was 0.9111.
7
Summary
The paper proposed a method for fraud detection in telecommunications, based on the user profiling and classification. The telecommunication users were profiled with the LDA probabilistic model. Fraudulent activity was detected on the basis of the threshold-type classification algorithm, assigning the accounts to one of two classes: fraudulent account and non-fraudulent account. The classification was performed on the basis of KL-divergence evaluation between a classified account’s LDA model, and a reference account’s LDA model.
References 1. Taniguchi, M., Haft, M., Hollmen, J., Tresp, V.: Fraud Detection in Communications Networks Using Neural and Probabilistic Methods. In: IEEE International Conference on Acoustics Speech and Signal Processing ICASSP 1998, vol. 2, pp. 1241–1244. IEEE, Los Alamitos (1998) 2. Xing, D., Girolami, M.: Employing Latent Dirichlet Allocation for Fraud Detection in Telecommunications. Pattern Recognition Letters 28, 1727–1734 (2007) 3. Hilas, C.S., Mastorocostas, P.A.: An Application of Supervised and Unsupervised Learning Approaches to Telecommunications Fraud Detection. Knowledge-Based Systems 21, 721–726 (2008) 4. Hilas, C.S.: Designing an Expert System for Fraud Detection in Private Telecommunications Networks. Expert Systems with Applications 36, 11559–11569 (2009) 5. Farvaresh, H., Sepehri, M.M.: A Data Mining Framework for Detecting Subscription Fraud in Telecommunication. Engineering Applications of Artificial Intelligence (2010) 6. Gibbs, A.L., Su, F.E.: On Choosing and Bounding Probability Metrics. International Statistical Review 70(3), 419–435 (2002) 7. Olszewski, D., Kolodziej, M., Twardy, M.: A Probabilistic Component for KMeans Algorithm and its Application to Sound Recognition. Przeglad Elektrotechniczny 86(6), 185–190 (2010) 8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003) 9. Hershey, J.R., Olsen, P.A.: Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. In: IEEE International Conference on Acoustics Speech and Signal Processing ICASSP 2007, vol. 4(6), pp. 317–320 (2007) 10. Blei, D.M., Franks, K., Jordan, M.I., Mian, I.S.: Statistical Modeling of Biomedical Corpora: Mining the Caenorhabditis Genetic Center Bibliography for Genes Related to Life Span. BMC Bioinformatics 7(1) (May 2006)
Classification of EEG in a Steady State Visual Evoked Potential Based Brain Computer Interface Experiment Zafer İşcan, Özen Özkaya, and Zümray Dokur Department of Electronics and Communication Engineering, Istanbul Technical University, 34469 Istanbul, Turkey {iscanz,ozkayao,dokur}@itu.edu.tr
Abstract. In this paper, electroencephalogram (EEG) signals of 20 subjects are classified in a steady state visual evoked potential (SSVEP) based brain computer interface (BCI) system by using 4 different stimulation frequencies in a program created by Visual C#. After applying proper pre-processing methods, power spectral density (PSD) based features are extracted around first and second harmonics of the stimulation frequencies. Average classification performance obtained from 20 subjects in 4-class classification is 83.62% with Nearest Mean Classifier (NMC). Results for 5-class classification, EEG segment size and gender differences are also analyzed in a detailed manner. The classification method is simple and very suitable for real-time experiments. Keywords: SSVEP, BCI, EEG, Classification.
1 Introduction SSVEP is a resonance phenomenon that occurs mainly in the visual cortex when a person focuses the visual attention on a flickering light source with a frequency above 4 Hz and the strongest SSVEP response is obtained in the stimulation frequencies between 5–20 Hz [1]. SSVEP response is widely distributed over the occipital and parietal lobes and SSVEP response is generally investigated using EEG [1]. SSVEPs are used in clinical and cognitive studies thanks to their high signal to noise ratio (SNR) and immunity to artifacts [2]. SSVEP is relatively independent of higher level cognitive processes, so this brings an advantage (robustness) to the SSVEP based BCIs. Besides, very little training is required for SSVEP based BCIs [1]. When a subject looks at a light source flickering in constant frequency, there is an increase in the flickering frequency and its harmonics on the EEG which can be measured by transforming the EEG time series to frequency domain [1]. Since the stimulation bandwidth is limited, increase in number of the stimulation frequencies decreases discrimination performance. In BCI literature, 2-class SSVEP based systems are widely proposed. However, there are not so many examples for multi-class interfaces [3]. Mukesh et al. [4], showed a new method that increase the selection number by using appropriate frequency combinations. For example, 6 selections can be done by using only 3 frequencies. Friman et al. [1] presented novel methods for detecting SSVEPs using multiple EEG signals. They found the combinations of A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 81–88, 2011. © Springer-Verlag Berlin Heidelberg 2011
82
Z. İşcan, Ö. Özkaya, and Z. Dokur
electrode signals that cancel strong interference signals in the EEG data and obtained high detection accuracy using short time segments. In this study, multi-class (4 and 5) EEG classification is performed on a very large (20 subjects) dataset. High classification performance is obtained with only a few electrodes in real-life conditions (no shielded rooms). In Section 2, information about the BCI experiment (stimulation interface, subjects, setup, recording conditions) is given. Section 3 describes feature extraction method including pre-processing steps. The classifiers used in the study and obtained results can be seen in Section 4. Conclusions and future work are summarized in Section 5.
2 BCI Experiment Setup 2.1 Stimulation Interface A stimulation program was designed in Microsoft Visual C#. In Fig. 1, interface for this program is given. In the stimulation interface, there is a black fixation point in the middle of the window and 4 red circles each flickering at a different constant frequency around this point. These frequencies can be independently adjusted according to the pre-defined choices. The diameter of each circle is 6 cm. In Table 1, selected frequencies for these circles are given. The frequencies written on Table 1 are the exact stimulation frequencies (delays resulted from software are included).
Fig. 1. SSVEP stimulation interface with four red circles flickering at different frequencies
Classification of EEG in a SSVEP Based Brain Computer Interface Experiment
83
Table 1. Flickering frequencies of the red circles Circle up down right left
Frequency (Hz) 4.60 6.43 8.03 10.70
2.2 Subjects 10 healthy males (mean age 25.2, varies between 21 and 30) and 10 healthy females (mean age 24.6, varies between 21 and 29) among the students of Istanbul Technical University were participated in the study after having their written consent. None of the participants reported to have neurological disease background. All the participants were informed about the recording procedure and ethical issues before the experiment. 2.3 Experiment Setup The recordings were done in a laboratory that is exposed to electrical and acoustical noises. Since a BCI system must work in real-life conditions, shielded cabin was not preferred. In Fig. 2, experiment setup is showed.
Fig. 2. Recording setup of BCI experiment
84
Z. İşcan, Ö. Özkaya, and Z. Dokur
In Fig. 2, stimulus computer where the red circles are presented to the subject can be seen. The EEG signals acquired from the 16 active electrodes attached to the ActiCap [5] are transferred to the V-Amp16 [5] amplifier by connectors. V-Amp16 device amplifies and digitizes the signal (sampling frequency (fs): 500 Hz) and sends the signal to the recording computer via USB connection. The distance between the subject and the stimulus computer’s monitor is 40 cm. The electrode placement can be seen in Fig. 3.
Fig. 3. Placement of the 16 active electrodes
16 electrode positions (TP9, CP5, CP1, CP2, CP6, TP10, P7, P3, PZ, P4, P8, PO9, O1, OZ, O2, PO10) are given according to the international 10 – 20 system. Reference electrode (REF) attached to the FCz and ground electrode (GND) attached to AFz are not shown in Fig. 3. Since the SSVEP response is widely distributed over the occipital and parietal lobes [1], electrodes are placed densely into this area. Similar placement approach was used in [6]. Electrode-skin impedances were below 20 kΩ for all electrodes in every subject. 2.4 Recording Conditions The recordings were done for 10 different conditions. subjects rested for a few seconds between the recordings. For each condition, the length of the EEG record is 30 seconds. Hence, for each subject, total length of the recorded EEG data is 300 seconds (5 min.). In Table 2, recording conditions are presented.
Classification of EEG in a SSVEP Based Brain Computer Interface Experiment
85
Table 2. 10 different recording conditions for each subject # 1 2 3 4 5 6 7 8 9 10
Attention is given on fixation point (without circles) upper circle lower circle right circle left circle upper circle lower circle right circle left circle fixation point (with circles)
Eyes focused on
Recording length
fixation point
upper circle lower circle right circle left circle fixation point
30 s.
In this study, only 5 recording conditions (#6 - #10) are evaluated (See Table 2) in the feature extraction and classification steps. In 4-class case, conditions #6, 7, 8, 9 are classified. Condition #10 is added in 5-class case. Conditions #2 - #5 are related to covert attention and will be analyzed in another study.
3 Feature Extraction The EEG data (30 s.) acquired from OZ channel of each subject is segmented into equal pieces (30, 15 and 10 pieces respectively when the segment length is equal to 1, 2 and 3 s.) and trends in the segmented data are removed. 4th degree band-pass (BP) butterworth filter (0.53 Hz – 40 Hz) is applied for filtering. Afterwards, discrete Fourier transforms (DFTs) of filtered EEG segments are calculated (Eq. 1). ∑
0,1,2, … ,
1
(1)
In Eq. 1, x(n) represents the discrete samples of EEG data, N is the number of samples for the transform. After calculation of DFT (X(k)) of EEG samples, square of the absolute value of X(k) is computed to obtain the power spectrum of EEG (Eq. 2). Power spectral density
|
|
0,1,2, … ,
1
(2)
Maximum power spectral density (PSD) amplitudes around the neighborhood of the 1st and 2nd harmonics of the stimulation frequencies are taken as features. Therefore, totally 8 features (2 for each frequency) are obtained in 4-class and 10 features are obtained in 5-class cases. Neighborhood was taken equal to the frequency resolution (the difference between successive points in the frequency axis). Although in literature, 3rd harmonics are also found to be useful [7] in the classification, performance obtained with only two harmonics generated better results in this study. In Fig. 4, feature extraction procedure is summarized.
86
Z. İşcan, Ö. Özkaya, and Z. Dokur
Fig. 4. Feature extraction blocks
4 Classification Different classifiers from PR Tools 4.0 [8] are tested on the dataset and the classifiers which generated the best results are presented here. Support Vector Machines (SVMs) preprocess the data to represent patterns in dimensions higher than the original feature space dimension. It achieves this task by nonlinear mapping to higher dimension. After mapping, data samples from two different classes become separable by a hyperplane [9]. In Fisher’s linear discriminant analysis (LDA), a linear function is obtained which provides the maximum ratio of between-class scatter to within-class scatter [9]. In nearest mean classifier (NMC), a feature vector is classified by measuring the Euclidean distances between each feature vector and each of class mean vectors and the feature vector’s class is determined by the nearest mean vector [9]. The results generated by the classifiers are given for 4 (25% by chance) and 5 class (20% by chance) cases and three different segment lengths (1, 2, 3 s.). Performances are given according to the mean overall accuracy after performing leave-one out classification. The classification results are summarized in Table 3 and Table 4 for 4-class and 5-class cases respectively. Table 3. Mean accuracies (%) in 4 class case
length male female total
Nearest Mean 1 s. 2 s. 3 s. 51.4 66.5 78.0 64.6 83.1 89.2 58.0 74.8 83.6
Fisher’s LDA 1 s. 2 s. 3 s. 49.4 67.1 74.0 64.0 83.5 88.0 56.7 75.3 81.0
1 s. 47.3 60.7 54.0
SVM 2 s. 3 s. 64.8 74.0 84.5 86.7 74.6 80.3
Table 4. Mean accuracies (%) in 5 class case
length male female total
Nearest Mean 1 s. 2 s. 3 s. 44.2 55.3 67.2 56.6 73.0 79.8 50.4 64.2 73.5
Fisher’s LDA 1 s. 2 s. 3 s. 42.3 55.2 64.0 56.8 73.7 81.8 49.6 64.4 72.9
1 s. 40.6 51.2 45.9
SVM 2 s. 3 s. 54.5 62.4 72.5 76.4 63.5 69.4
In 4-class case, best performance is obtained with NMC for 1s. and 3 s. lengths. This result shows that the feature vectors tend to spread around class means. Fisher’s LDA generated the best performance for a segment length of 2 s. These results are also valid for 5-class case. Although the mean classification accuracies obtained for female subjects are higher than male subjects, the difference between the means are
Classification of EEG in a SSVEP Based Brain Computer Interface Experiment
87
not found to be statistically significant after performing t-test (p=0.0522). However, since the p value is very small, there is a doubt on the validity of the null hypothesis (no difference between means). In Fig. 5, individual performances of the subjects are presented (NMC, length: 3 s., 4-class case).
Fig. 5. Individual (male: dark gray, female: light gray) performances in terms of mean overall accuracy (%) (NMC, 4-class, 3 s.)
In Fig. 5, 7 females reach a classification performance above 90% (2 of them reached 100%). On the other hand, worst performance also belongs to a female subject (52.5%). In male subjects, there is only 1 case above 90%. However, none of the males had a performance below 70%. Mean accuracies and standard deviations are 89.25% (± 15.09) for females and 78% (± 8.06) for males.
5 Conclusions and Future Work According to Tables 3 and 4, it is clear that the classification accuracy becomes better as the segment length increases. This is an expected result and consistent with the studies in literature since the frequency resolution is directly proportional with the segment length. When the segment size is smaller, 1st and 2nd harmonics of the stimulation frequencies are not well localized in the frequency domain. It should also be noted that the EEG is acquired in an ordinary laboratory that is exposed to electrical and acoustical noise to be able to measure the BCI performance for real-life conditions. In this study, there is no user feedback. Inclusion of user feedback can increase the classification performance. The method is simple and very suitable for real-time BCI experiments. In future studies, real-time experiments will be performed and analyzed.
88
Z. İşcan, Ö. Özkaya, and Z. Dokur
Even SSVEP based BCIs work with healthy or moderately disabled people, they have not been validated with subjects unable to control their gaze. Although the prevailing view in the BCI literature claims that SSVEP BCIs would not work with this kind of severely disabled people, in visual attention literature there are some findings that people can shift their visual attention among different stimuli without shifting their gaze [10]. This phenomenon is called covert attention and it suggests that some SSVEP based BCIs may not depend on the gaze control [10]. In the future, dataset acquired for covert attention will be analyzed for classification. Acknowledgments. This study was supported by Istanbul Technical University Scientific Research Project (ITU-BAP). Project No: 33246.
References 1. Friman, O., Volosyak, I., Gräser, A.: Multiple Channel Detection of Steady-State Visual Evoked Potentials for Brain-Computer Interfaces. IEEE Transactions On Biomedical Engineering 54(4), 742–750 (2007) 2. Srinivasan, R., Bibi, F.A., Nunez, P.L.: SSVEPs: Distributed Local Sources and Wave-like Dynamics are Sensitive to Flicker Frequency. Brain Topography 18(3), 167–187 (2006) 3. Maggi, L., Parini, S., Piccini, L., Panfili, G., Andreoni, G.: A four command BCI system based on the SSVEP protocol. In: Proceedings of the 28th IEEE EMBS Annual International Conference, New York, pp. 1264–1267 (2006) 4. Mukesh, T.M.S., Jaganathan, V., Reddy, M.R.: A novel multiple frequency stimulation method for steady state VEP based brain computer interfaces. Physiological Measurement 27, 61–71 (2006) 5. Brain Products GmbH - Solutions for neurophysiological research, http://www.brainproducts.com 6. Trejo, L.J., Rosipal, R., Matthews, B.: Brain–Computer Interfaces for 1-D and 2-D Cursor Control: Designs Using Volitional Control of the EEG Spectrum or Steady-State Visual Evoked Potentials. IEEE Transactions on Neural Systems And Rehabilitation Engineering 14(2), 225–229 (2006) 7. Müller-Putz, G.R., Pfurtscheller, G.: Control of an Electrical Prosthesis With an SSVEPBased BCI. IEEE Transactions on Biomedical Engineering 55(1), 361–364 (2008) 8. PRTools: The Matlab Toolbox for Pattern Recognition, http://www.prtools.org/ 9. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification, 2nd edn. John Wiley & Sons, New York (2001) 10. Allison, B.Z., McFarland, D.J., Schalk, G., Zheng, S.D., Jackson, M.M., Wolpaw, J.R.: Towards an independent brain–computer interface using steady state visual evoked potentials. Clinical Neurophysiology 119, 399–408 (2008)
Fast Projection Pursuit Based on Quality of Projected Clusters Marek Grochowski1 and Włodzisław Duch1,2 1
2
Department of Informatics, Nicolaus Copernicus University, Toru´n, Poland School of Computer Engineering, Nanyang Technological University, Singapore
[email protected] Google: W. Duch
Abstract. Projection pursuit index measuring quality of projected clusters (QPC) introduced recently optimizes projection directions by minimizing leave-one-out error searching for pure localized clusters. QPC index has been used in constructive neural networks to discover non-local clusters in high-dimensional multiclass data, reduce dimensionality, aggregate features, visualize and classify data. However, for n training instances such optimization requires O(n2 ) calculations. Fast approximate version of QPC introduced here obtains results of similar quality with O(n) effort, as illustrated in a number of classification and data visualization problems. Keywords: Projection pursuit, Classification, Dimensionality reduction, Naive Bayes, Neural networks.
1 Introduction Projection pursuit (PP) searches for the most “interesting” projections of multidimensional data by optimizing some objective functions referred to as the projection index [1,2]. Many projection indices have been introduced, both for unsupervised and for supervised learning. Such algorithms as the principal component analysis (PCA), independent component analysis (ICA) and Fisher’s discriminant analysis (FDA) are special cases of the projection pursuit approach. The ”pursuit” aspect involves search for sequence of unique projections that gives different, low-dimensional insight into data structures. Most PP algorithms, including the QPC presented here, use linear projections. The major advantage of PP methods is the potential to avoid the “curse of dimensionality” by reducing the data to low-dimensional space. The noisy and non-informative features are ignored, and only most valuable relations, depending on the definition of projection index, are preserved in the reduced space. Transformation of data by PP may be presented in form of a feedforward neural network where sequence of hidden nodes represent successive projections and learning procedure corresponds to optimization of certain PP index. In contrast to the backpropagation learning of all network parameters at the same time PP indices define intermediate goals for learning [3] making the final separation of the data much easier. The Quality of Projected Clusters (QPC) projection A. Dobnikar, U. Lotriˇc, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 89–97, 2011. c Springer-Verlag Berlin Heidelberg 2011
90
M. Grochowski and W. Duch
index [4] is aimed at finding linear transformations that create compact clusters of vectors, each with vectors from a single class, separated from other clusters. Each QPC node may thus map data into several useful pure clusters, while sigmoid functions in multi-layer perceptrons (MLPs) perform much simpler mappings. In the next section the QPC index is described and some modifications that decrease computational cost are discussed, followed by comparison of speed of learning and quality of generated projections in terms of classification generalization. Various algorithms may be used in the space generated by QPC transformations. Here we have used Naive Bayes algorithm on the trained data, and on the same data reduced by the fast and the original QPC version.
2 The QPC Projection Index Learning Speed Improvement For a given dataset X = {x1 , . . . , xn } ⊂ Rd , where each vector xi is associated with class C(xi ), the QPC index is defined by [5]: QP C(w) =
n
αij G w T (xi − xj )
(1)
i,j=1
where αij are real constants that satisfy conditions: if C(xi ) = C(xj ) then αij > 0 and if C(xi ) = C(xj ) then αij < 0. Function G(x) should be localized with maximum for x = 0, for example it is a Gaussian function. Then for a given direction w ∈ Rd vectors xi and xj will increase QPC value if after projection on w they fall close to each other and are from the same class, but if they are from different classes QPC index is decreased by a value dependent on distance between these vectors after projection on w. Thus maximization of Eq. (1) leads to linear transformation that create compact and pure clusters of vectors from the same class, well separated from other clusters, provides a leave-one-out estimator measuring quality of this projection. Proper choice of constants αij and width of function G(x) might force QPC optimization to prefer solutions with higher between-cluster separation over solutions characterized with better within-class purity and compactness. In all experiments presented in this paper Gaussian functions were used for localization. To normalize the QPC index value αij = 1/nnj is used for all i, j = 1, . . . , n satisfying condition C(xi ) = C(xj ) and αij = 1/n(n − nj ) if C(xi ) = C(xj ), where nj denote number of instances that belong to class associated with xj , and n is the number of all instances. Optimization of the QPC index provides solutions that might be useful in many machine learning supervised learning applications for data visualization and dimensionality reduction. Recently [5,6] this index was successfully applied to train and construct several neural networks architectures for classifications of multi-class problems. Major disadvantage of QPC (like most of the projection pursuit indexes) is high computational cost. Each evaluation of Eq. (1) has computational complexity O(dn2 ), where d is the number of dimensions and n is the number of instances in training dataset, which may make this approach useless for datasets with large number of instances, especially when many iterations is needed for convergence of the optimization process. This drawback can be overcome by using a set of prototypes T = {t1 , . . . , tk } as a reference points providing estimation for dataset class distribution. For given set of
Fast Projection Pursuit Based on QPC
91
prototypes T , where each prototype ti is associated with class C(ti ), the approximation of the QPC index might be expressed as follows: QP C(w) =
k n
αij G w T (xi − tj )
(2)
j=1 i=1
where constants αij > 0 if C(xi ) = C(tj ) and αij < 0 if C(xi ) = C(tj ), accordingly. If positions of prototypes are not fixed then Eq. (2) has (k+1)×d parameters to optimize (where k is the number of prototypes) while optimization of Eq. (1) must adjust only d weight components. However, if k n then computational cost becomes linear in the number of instances and in the number of features O(kdn). Solutions generated by maximization of Eq. (2) strongly depend on the number of prototypes and their initialization (position and label association). The algorithm described below allows for computing an approximation to the QPC index value for a given direction without the need of finding reference points, and might also be used for estimation of initial positions of prototypes. Consider the set of vectors xi ∈ Rd (i = 1, . . . , n) projected on the w direction, with the whole span of projected points divided into k equal intervals of width h: ymin = min w T xi , i
ymax = max w T xi ,
h=
i
1 (ymax − ymin ) . k
(3)
Let βi be the center of the i-th interval: βi = ymin + h (i − 1/2) ,
i = 1, . . . , k.
(4)
For each class Ci and j-th interval the partial QPC index is defined by: ˜ Ci ,j = Q
n
αij G wT xi − βj
(5)
i=1
where αij > 0 if C(xi ) = Cj and αij < 0 if C(xi ) = Cj . Let associate interval j with class Ci that gives maximum: ˜ Ci ,j C(βj ) = arg max Q Ci
(6)
The approximate value of QPC index for direction w and k intervals is computed from: QP C(w) ≈
k n
αij G w T xi − βj
(7)
j=1 i=1
where αij > 0 if C(xi ) = C(βj ) and αij < 0 if C(xi ) = C(βj ). The computational cost of evaluation of Eq. (7) is O(kndc) where c denotes the number of classes. Eq. (7) might be directly used for searching for optimal w, however this approximation is used here only for setting initial positions of the prototypes and their labels. Direction w define line in d dimensional space y = γw + µ, where γ ∈ R and µ ∈ Rd is an arbitrary point along this line that may be taken as the center position of all data vectors
92
M. Grochowski and W. Duch
X . Then for a given direction w and k intervals with centers in βi , initial positions of prototypes ti ∈ Rd placed on this line are given by: (8) ti = βi w + µ − (w T µ)w . These prototypes are used here to initialize optimization procedure of the QPC index given by Eq. (2). Maximum number of prototypes do not exceed the number of intervals k, but might be reduced if prototypes for the same class become neighbors after projection. Additionally, the width of these intervals give a direct estimation of the spread of G(x) function. For Gaussian functions setting the standard deviation to σ = h guarantees that the par˜ Ci ,j given by Eq. (5), will depend mostly on data projected inside tial QPC function Q the i-th interval, and to a lesser extent on vectors that belong to the adjacent intervals.
3 Results 3.1 Learning Speed Comparison Tab. 1 presents comparison of time needed for training of the standard QPC index defined by Eq. 1 (denoted here as QPC1) and the approximated QPC index (denoted here as QPC2) defined by Eq. (2) for several classification problems with various size and complexity of inherent relations. Most of these datasets come from the UCI repository [7] (Abalone, Appendicitis, Australian Credit Rating, Breast Cancer Wisconsin, Glass, Heart, Ionosphere, Iris, Ljubljana Breast Cancer, Monk’s 1 training part, Congressional Voting Records, Spam and Wine). In addition two artificial dataset were used: 10-dimensional parity problem and Concentric Rings dataset containing 2 important features defining points inside 4 rings (one per class) and 2 noise variables drawn from uniform distribution. Both QPC1 and QPC2 use Gaussian function for G(x) and a gradient descent procedure with the same learning rate (0.1) and the same stop condition. Initial positions of the prototypes for QPC2 have been set according to Eq. (8) with number of intervals k = 20. To avoid occurrence of local minima each optimization process was initialized 10 times with different weight values w between [−0.5, 0.5] and after short optimization the most promising solution has been converged to the final value. Each learning procedure was repeated 10 times and the average time required for convergence, the number of iterations and the final index value are reported in Tab. 1. Value of projection index referred in Tab. 1, both for QPC1 and QPC2, have been computed according to Eq. (1). Results presented in Tab. 1 show great improvement of QPC2 performance compared to the QPC1. The Wilcoxon’s signed-rank test [8] indicates significant difference of the average time used for computation at a confidence level of 99% (p-value of 0.0061) in favor of QPC2. Reduction of computation time occurs especially for the datasets with large number of instances like Abalone and Spam. Results for those data were excluded from statistical analysis to avoid dominance of these large values. Projections obtained from QPC2 provide good approximation of solutions that might be found by the full QPC1 index. In most cases improvement of performance involves only slight loss of quality of obtained solutions. Fig. 1 presents scatter plots generated
Fast Projection Pursuit Based on QPC
93
Table 1. Comparison of performance of the full (QPC1) and approximate optimization (QPC2) of the QPC index Data Set
Vec. Feat. Class
Appendicitis 106 Monk’s 1 124 Iris 150 Wine 178 Ionosphere 200 Sonar 208 Glass 214 Heart Statlog 270 L.Breast 277 Heart Cleveland 297 Voting 435 Breast Cancer W. 683 Australian Credit 690 P.I.Diabetes 768 Concentric Rings 800 Parity 10-bits 1024 Average Wilcoxon p-value Large data Abalone Spam
4177 4601
7 6 4 13 34 60 9 13 9 13 16 9 14 8 4 10
7 57
2 2 3 3 2 2 6 2 2 2 2 2 2 2 4 2
QPC1 Index Time −2 ×10 [s] 35.5 ± 0.2 3.6 ± 0.9 15.2 ± 0.9 3.7 ± 1.6 76.5 ± 0.1 2.0 ± 0.3 64.9 ± 0.0 3.7 ± 0.4 47.1 ± 0.2 16.6 ± 11.7 37.3 ± 0.4 27.4 ± 19.7 31.2 ± 0.0 5.0 ± 0.6 29.9 ± 0.3 20.3 ± 1.9 13.5 ± 0.1 14.5 ± 3.9 29.4 ± 0.2 28.7 ± 7.1 70.5 ± 5.1 136.2 ± 9.0 66.0 ± 0.0 65.8 ± 11.1 51.2 ± 0.1 54.3 ± 7.2 17.8 ± 0.0 68.9 ± 13.3 15.7 ± 0.2 49.2 ± 11.7 26.6 ± 0.0 32.1 ± 5.6 39.3 33.3
Iterations 163.0 ± 95.5 148.0 ± 71.7 46.5 ± 12.3 77.0 ± 4.2 213.0 ± 77.3 178.0 ± 20.3 84.5 ± 19.9 238.0 ± 44.3 217.5 ± 111.5 307.5 ± 156.6 855.0 ± 322.1 119.5 ± 26.5 138.5 ± 28.9 120.0 ± 21.9 101.0 ± 62.8 22.5 ± 6.8 189.3
QPC2 Index Time Iterations −2 ×10 [s] 32.3 ± 0.5 4.3 ± 0.6 111.0 ± 47.0 12.2 ± 1.4 3.9 ± 0.4 101.0 ± 34.5 75.6 ± 0.5 2.4 ± 0.1 58.0 ± 13.8 61.8 ± 0.6 4.0 ± 0.1 109.5 ± 15.2 41.6 ± 0.9 5.0 ± 0.2 110.0 ± 24.7 32.0 ± 0.5 7.8 ± 0.1 144.0 ± 10.7 28.3 ± 1.5 5.0 ± 0.5 117.0 ± 30.0 28.3 ± 0.5 6.8 ± 0.7 170.5 ± 47.4 10.6 ± 1.3 6.7 ± 0.8 107.5 ± 54.8 27.9 ± 0.5 7.8 ± 0.9 246.0 ± 42.0 81.4 ± 0.3 10.8 ± 0.6 214.0 ± 14.5 59.9 ± 1.4 8.8 ± 0.8 172.0 ± 60.9 49.9 ± 0.4 6.6 ± 0.3 89.5 ± 21.3 17.6 ± 0.1 6.9 ± 0.3 100.5 ± 19.1 15.2 ± 0.5 5.4 ± 1.1 75.0 ± 52.2 26.6 ± 0.0 17.7 ± 4.8 209.0 ± 243.7 37.6 6.9 133.4 0.0106 0.0061 0.0879
28 18.9 ± 0.1 3148.4 ± 609.8 184.0 ± 59.6 15.2 ± 0.2 29.8 ± 1.3 73.0 ± 13.4 2 26.2 ± 0.0 5260.5 ± 105.6 105.5 ± 2.8 25.3 ± 0.2 184.7 ± 4.1 102.0 ± 3.5
by projection of data vectors on the first two directions wT1 x and wT2 x found by optimization of QPC1 and QPC2. The second direction w2 have been found in the direction orthogonal to the first one. For the Australian dataset distinct separation between two groups of vectors is obtained. First projection on w 1 is sufficient to distinguish this two clusters. The Monk’s 1 problem projected on the two dimensional space generated by QPC2 revealed inherent relations for this artificial dataset with symbolic features, leading to almost complete separation of instances with opposite labels. For the 10-bit parity problem both approaches found correct projections on diagonals of the hypercube representing Boolean function. In case of Concentric Rings noise has been suppressed and the two-dimensional ring structure hidden in this data was recovered. 3.2 Comparison of Generalization The QPC projection index may be used for generation of new features that should reveal interesting aspects of analyzed data. Such features may be beneficial for training of almost any learning machines. Tab. 2 presents results obtained by training the Naive Bayes (NB) classifier with kernel density estimation on problems used for performance testing. First column contains results of NB trained on the original data. Each successive column represent results for NB trained on data projected on 1, 2 and 3 directions generated by the full (QPC1) index maximization and by its fast approximation (QPC2). Classification accuracy has been estimated using 10 fold stratified cross-validation repeated 10 times for each dataset and each method. To compare generalization of NB classifier trained with and without initial QPC transformation for each dataset corrected resampled t-test was used [9] and significant differences (at significance level 0.05) are marked with dots (see Tab. 2).
94
M. Grochowski and W. Duch Australian Credit Rating
Australian Credit Rating
1
1.5
1
0.5
0.5 0
w x
w x
0 2
2
−0.5
−0.5 −1 −1
−1.5
−1.5
−2 −1
−0.5
0
0.5
1
−2 −1
1.5
−0.5
0
0.5 w x
w x 1
1
1.5
2
1
Monks 1
Monks 1
1.5
2
1.5 1 1 0.5
w x
w x
0.5
0
2
2
0
−0.5 −0.5 −1 −1 −1.5
−1.5 −1.5
−1
−0.5
0 w x
0.5
1
−2 −2
1.5
−1.5
−1
−0.5
3
3
2
2
1
1
w x
0
2
2
w x
4
−1
−2
−2
−3
−3
−2
−1
0 w1 x
1
1.5
0
−1
−3
0.5
1
Parity 10
4
−4 −4
0 w x
1
Parity 10
1
2
3
−4 −4
4
−3
−2
−1
Concentric Rings
0 w1 x
1
2
3
4
Concentric Rings
1
0.5
0.5
0
2
2
w x
1
w x
1.5
0
−0.5
−0.5
−1
−1 −1.5
−1
−0.5
0 w x 1
0.5
1
1.5
−1.5 −1.5
−1
−0.5
0 w x
0.5
1
1.5
1
Fig. 1. Examples of the first two projections found by maximization of the full QPC1 index (left) and the approximated QPC2 index (right) for the Australian credit, the Monk’s 1 problem, the 10-bit Parity and the Concentric Rings
Fast Projection Pursuit Based on QPC
95
Table 2. Average accuracy of the Naive Bayes with kernel density estimation in the 10x10 stratified CV test for the whole dataset and after training on dataset reduced to 1, 2 and 3 dimensions using two QPC versions Data set
Naive Bayes
1 Appendicitis 84.4 ± 10.2 87.4 ± 8.2 Monk’s 1 71.5 ± 11.3 71.3 ± 11.0 Iris 95.7 ± 4.9 98.0 ± 4.0 Wine 97.7 ± 3.5 92.5 ± 5.8 ◦ Ionosphere 84.4 ± 7.9 79.9 ± 9.1 Sonar 75.8 ± 10.1 74.1 ± 10.4 Glass 60.3 ± 9.9 55.3 ± 8.3 Heart Statlog 79.8 ± 7.3 80.2 ± 7.2 L.Breast 72.7 ± 6.1 72.3 ± 5.3 Heart Cleveland 79.3 ± 7.3 80.7 ± 7.7 Voting 89.8 ± 4.7 95.4 ± 2.9 • Breast Cancer W. 96.7 ± 2.0 96.1 ± 2.1 Australian Credit 68.4 ± 6.0 85.3 ± 4.7 • P.I.Diabetes 73.6 ± 5.1 76.4 ± 4.4 • Concentric Rings 85.9 ± 3.6 64.0 ± 4.3 ◦ Parity 10 bits 44.4 ± 6.9 85.5 ± 10.3 • Average 78.8 80.9 Win/Tie/Lose 4/10/2 Wilcoxon NB vs. QPC+NB p-value 0.756 Wilcoxon QPC1+NB vs. QPC2+NB p-value
QPC1+NB 2 86.1 ± 8.8 82.7 ± 13.9 • 95.9 ± 5.2 96.2 ± 5.2 84.0 ± 7.8 75.4 ± 10.1 56.0 ± 8.7 82.8 ± 6.8 72.6 ± 6.6 82.8 ± 6.9 95.1 ± 3.1 • 97.0 ± 1.9 85.5 ± 4.4 • 74.9 ± 4.5 86.4 ± 3.8 90.2 ± 8.9 • 84.0 4/12/0 0.049
3 84.9 ± 9.6 89.2 ± 9.4 • 95.8 ± 5.2 97.7 ± 3.7 85.4 ± 7.3 75.8 ± 9.3 59.9 ± 8.9 82.6 ± 7.2 73.7 ± 6.4 82.7 ± 7.4 94.7 ± 3.4 • 97.0 ± 1.9 86.2 ± 4.7 • 73.9 ± 5.2 86.7 ± 3.6 90.9 ± 7.7 • 84.8 4/12/0 0.002
1 87.1 ± 8.9 67.2 ± 12.7 96.9 ± 4.6 91.6 ± 6.1 ◦ 81.7 ± 9.1 73.3 ± 10.5 54.8 ± 9.8 80.5 ± 7.5 70.6 ± 6.3 80.5 ± 7.1 95.3 ± 3.0 • 95.7 ± 2.3 85.4 ± 4.5 • 76.3 ± 4.5 63.3 ± 4.4 ◦ 89.3 ± 11.2 • 80.6 3/11/2 0.918 0.121
QPC2+NB 2 86.0 ± 9.2 82.9 ± 13.0 • 95.9 ± 5.2 97.4 ± 4.0 83.2 ± 8.0 75.9 ± 10.4 56.5 ± 9.7 82.7 ± 7.0 70.8 ± 7.2 83.1 ± 7.6 94.7 ± 3.1 • 96.9 ± 1.9 85.4 ± 4.4 • 73.9 ± 4.6 84.9 ± 4.9 93.3 ± 7.7 • 84.0 4/12/0 0.109 0.776
3 86.1 ± 9.0 87.9 ± 11.0 • 96.0 ± 5.1 97.6 ± 3.8 85.5 ± 7.5 76.5 ± 9.0 59.1 ± 9.8 83.0 ± 7.1 70.6 ± 8.0 83.5 ± 7.2 • 94.4 ± 3.2 • 97.2 ± 1.8 85.8 ± 4.4 • 72.7 ± 5.1 85.6 ± 4.0 94.9 ± 6.6 • 84.8 5/11/0 0.039 0.717
•- statistically significant improvement, ◦- statistically significant degradation
Features produced by QPC2 lead to similar accuracy to that of full QPC1. The Wilcoxon’s signed-rank test shows no significant difference in accuracy of NB trained on the first three directions obtained by both QPC optimizations, giving p-value greater than 0.1 in all three cases (Tab. 2 last row). For all datasets t-test also shows no significant differences in NB accuracy between QPC1 and QPC2 transformation. In most cases NB trained on data projected on the first QPC direction produce results that are not significantly different from NB trained on the original data (10 ties obtained by corrected resampled t-test with level of significance equal to 5%). For 2 datasets t-test shows difference in accuracy in favor of original NB, but for 4 datasets the QPC transformations have improved NB generalization. For NB trained on data projected to the first two directions no significant degradation of accuracy is noted with comparison to NB trained on the original dataset. The Wilcoxon’s signed-rank test confirms that there is no significant difference between accuracy of NB trained on first QPC projection and NB trained on original data, and there is significant difference in favor of NB trained on data projected to 2 or 3 dimensions obtained from QPC index both for QPC1 and QPC2. Thus a great reduction in dimensionality is obtained by using QPC features.
4 Discussion The approximate version of the Quality of Projected Clusters projection pursuit method introduced in this paper greatly improve performance without degradation of the quality of results. As has already been stressed [10] separability is not the best goal of learning when problems are difficult, some intermediate tasks should be defined to derive
96
M. Grochowski and W. Duch
information that may help in finding optimal solutions. Many methods fail on difficult problems, such as the parity problem or the noisy concentric rings problem, but searching for good linear projection direction followed by simple one-dimensional nonlinear functions to distinguish pure clusters after the projection handles such problems without much effort. Therefore we are confident that such methods provide important computational intelligence tools. Projections found by QPC may be used to enhance data representation expanding feature spaces (this was done in [11], where remarks on relations with kernel methods may be found). Each projection may also be implemented as a node in a hidden layer of feedforward network. This may be either followed by a simple linear layer (as in the multilayer perceptrons), or used only for initialization of weights. The prototypes obtained from QPC2 training may be directly used for classification as the nearest prototype vectors, or used for initialization in any radial-basis function method. The full QPC index has already been successfully applied to several constructive neural network architectures including QPC-NN [6] and QPC-LVQ [5]. The QPC-NN method build neural network optimizing QPC index within general sequential constructive method scheme proposed by [12]. The QPC-LVQ combines learning vector quantization [13] to map local relations with linear projections given by QPC to handle non local relations. Modification introduced in previous section should considerably increase performance of the QPC-based networks without loss of their generalization powers. Results of all these procedures will be presented in a longer paper in the near future. Acknowledgment. This work was supported by the Polish Ministry of Higher Education under research grant no. N N516 500539.
References 1. Friedman, J.H., Tukey, J.W.: A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. 23(9), 881–890 (1974) 2. Friedman, J.: Exploratory projection pursuit. Journal of the American Statistical Association 82, 249–266 (1987) 3. Duch, W.: K-separability. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 188–197. Springer, Heidelberg (2006) 4. Grochowski, M., Duch, W.: Projection Pursuit Constructive Neural Networks Based on Quality of Projected Clusters. In: K˚urková, V., Neruda, R., Koutník, J. (eds.) ICANN 2008„ Part II. LNCS, vol. 5164, pp. 754–762. Springer, Heidelberg (2008) 5. Grochowski, M., Duch, W.: Constrained learning vector quantization or relaxed kseparability. In: Alippi, C., Polycarpou, M., Panayiotou, C., Ellinas, G. (eds.) ICANN 2009. LNCS, vol. 5768, pp. 151–160. Springer, Heidelberg (2009) 6. Grochowski, M., Duch, W.: Constructive Neural Network Algorithms that Solve Highly Non-Separable Problems. Studies in Computational Intelligence, vol. 258, pp. 49–70. Springer, Heidelberg (2010) 7. Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html 8. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945) 9. Nadeau, C., Bengio, Y.: Inference for the generalization error. Machine Learning 52(3), 239– 281 (2003)
Fast Projection Pursuit Based on QPC
97
10. Duch, W.: Towards comprehensive foundations of computational intelligence. In: Duch, W., Mandziuk, J. (eds.) Challenges for Computational Intelligence, vol. 63, pp. 261–316. Springer, Heidelberg (2007) 11. Maszczyk, T., Duch, W.: Support feature machines: Support vectors are not enough. In: World Congress on Computational Intelligence, pp. 3852–3859. IEEE Press, Los Alamitos (2010) 12. Muselli, M.: Sequential constructive techniques. In: Leondes, C. (ed.) Optimization Techniques. Neural Network Systems, Techniques and Applications, vol. 2, pp. 81–144. Academic Press, San Diego (1998) 13. Kohonen, T.: Self-organizing maps. Springer, Heidelberg (1995)
A New N-gram Feature Extraction-Selection Method for Malicious Code Hamid Parvin, Behrouz Minaei, Hossein Karshenas, and Akram Beigi School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran {Parvin,B_Minaei,karshenas,Beigi}@iust.ac.ir
Abstract. N-grams are the basic features commonly used in sequence-based malicious code detection methods in computer virology research. The empirical results from previous works suggest that, while short length n-grams are easier to extract, the characteristics of the underlying executables are better represented in lengthier n-grams. However, by increasing the length of an n-gram, the feature space grows in an exponential manner and much space and computational resources are demanded. And therefore, feature selection has turned to be the most challenging step in establishing an accurate detection system based on byte n-grams. In this paper we propose an efficient feature extraction method where in order to gain more information; both adjacent and non-adjacent bigrams are used. Additionally, we present a novel boosting feature selection method based on genetic algorithm. Our experimental results indicate that the proposed detection system detects virus programs far more accurately than the best earlier known methods. Keywords: Malicious Code, N-gram Analysis, Feature Selection.
1 Introduction The machine learning’s main aim is to enhance the efficiency of favorite task(s), and so it tries to find and exploit regularities in training data. Machine learning has the general goal of constructing computer programs that can automatically be improved with experience. Detecting fraudulent credit card transactions is one of the successful applications of machine learning. There are many others which machine learning can successfully be applied to them [1]. The promising results obtained from applying machine learning techniques in the many fields, especially in intrusion detection, has encouraged researchers to utilize them in virus detection problem as well [2, 3]. The obvious advantage is that, there is no need to go through the laborious process of building a database of virus signatures. Instead, a sample set of malicious and benign codes are used to train a classifier system, and then the trained classifier is used to evaluate new executables and detect malicious ones. Although, it is not a long time since the researchers have begun applying machine learning and data mining techniques to this field, quite interesting results have been obtained which opens the hope for further success in the near future. Prior to the classifier’s training phase, the most appropriate features of the data that best discriminate various target classes of the problem should be extracted form the A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 98–107, 2011. © Springer-Verlag Berlin Heidelberg 2011
A New N-gram Feature Extraction-Selection Method for Malicious Code
99
set of all available features. The principal aim of this task is the reduction of dimensionality of feature space as much as possible while holding the semantics of data fixed. In the context, the features that best discriminate malicious codes from benign ones should be selected and be used in training and classification process. Researchers have proposed using a variety of different features like, binary profiling of files, string sequences, hex dumps [4] or a table representation of the file [5] for malicious code detection in the literature. N-gram analysis initially used in the field of natural language processing and document search is one of the most important techniques for feature extraction. Byte n-grams are overlapping substrings, collected in a sliding-window fashion where a window of fixed size slides one byte at a time. The huge number of n-grams which are often resulted from the feature extraction process makes them ineffective to be directly used in classification techniques. Therefore, a feature selection mechanism is inevitable. Several feature selection techniques applicable to n-gram features are proposed in [6], information gain [7] and class-wise document frequency [8] are among the most important proposed techniques in related works. In this paper we propose an efficient bi-gram extraction technique where in contrast to previous works that only use adjacent bytes, in order to catch more byte dependency information, non-adjacent bytes are considered as well. Once the feature extraction phase is over, a novel boosting feature selection technique based on genetic algorithm is used to gradually select the most discriminating bi-grams as final features to be used in classifier training and classification process. The paper proceeds by giving a brief theoretical background of the issues discussed here. Our proposed method is explained in section 3. Section 4 presents the results obtained through the experiments. Section 5 concludes the paper.
2 Background As a pioneer, Cohen [6] has done the first major theoretical study on viruses. He has defined virus as a program that can infect other programs by modifying them to include a possibly evolved copy of itself [7, 8]. He used Turing machine to introduce the notion of viral sets and to formalize a virus as a word on a Turing machine tape with the ability to duplicate or mutate when it is activated in a suitable environment. Using this theoretical basis he showed that virus detection is an undecidable problem in general [9, 10]. Adleman [10] proposed the definition of a wider class of attacks, namely computer infections or malwares. He defined several properties for a program and using these properties, defined different kinds of programs and viruses. Filiol [9] defined malwares as a simple or self-reproducing offensive program that is installed in an information system without users’ knowledge in order to violate confidentiality, integrity or availability of that system, or susceptible of falsely incriminating the owner or user of the program in a computer offense. Kolter and Maloof [11] referred to malicious code as any code added, changed or removed from a software system to intentionally cause harm or subvert the system’s intended function. Reddy and Pujari [7] defined a computer virus to be a code that recursively replicates a possibly evolved copy of itself.
100
H. Parvin et al.
As it can be observed, all of the proposed definitions are common in a key feature: the ability of virus to self replicate (implicating its name in correspondence to biological viruses). In order to enable a classifier to classify new samples, it should pass a learning phase in which it is trained with a set of training samples. Each instance in the sample set is represented by the values of a number of features. Indeed, these feature values describe different instances of the domain we are dealing with. For example features like skin color, size, and gender with values yellow, 160 centimeters and female respectively, possibly describe a woman living in Far East. After the training phase, to approximate the classification accuracy of a classifier, it is evaluated using an unseen test data set. Usually a single data set is partitioned into two parts to feed data for both training and testing. K-fold cross-validation is also a method that divides the data set to k separate parts, and uses k-1 of these as the training set and the remaining partition for test. This process is repeated k times, each time with a different partition as the test set. At the end, classification accuracy is averaged over all runs. In the context of malicious code detection, n-grams are byte sequences extracted from binary executables which represent certain characteristics of the codes implemented. The length of an n-gram is an important parameter that affects all phases of a classifier based detection system including feature extraction, feature selection and classification. For example, when n-grams of two bytes length are extracted, irrespective of what executable is under operation, a total number of 216 different features are possible. Obviously, as the length of n-gram increases, the number of different possible features grows exponentially. To cope with this difficulty a second parameter is introduced to the n-gram extraction process which controls the upper bound of the number of different n-grams that can be extracted from an input file. This upper bound is often referred to as profile size [5, 7, and 11]. The n-gram extraction process starts by sliding a window of size n over the input file, and taking the appropriate action with each of the values spotted in the window according to one of the following strategies: • If the spotted value is previously seen, its count is increased. • Otherwise, the spotted value is treated as a new n-gram and its count is set to The sliding window starts from the beginning of the input file and each time moves a single byte toward the end of the file. The same process is repeated for all of the executables given in the sample set. In this way a vast number of different n-grams with different frequencies are recorded and thus a selection mechanism should be employed, otherwise the practicality of the final detection system will be questioned. But to allow the selection strategy to select the best discriminating n-grams among all extracted n-grams, a “group of files” statistics is also needed in addition to simple file statistics of n-grams. The selected n-grams will be used as sample features that feed the classifier. It is important to note that this feature selection is performed only once during the training process of the classifier and according to the statistical information gained from training samples. After that and while deploying the detection system based on the classifier, the n-gram extraction mechanism only searches for the selected features in the file under test.
A New N-gram Feature Extraction-Selection Method for Malicious Code
101
Many feature selection methods have been proposed to be used in problems where the feature space has many dimensions. Some of these methods are particularly recognized to be applicable in virus detection field. Document frequency-based and information gain-based feature selection methods are among the most important ones being proposed in recent works [7, 8]. Virus detection is often considered as a pairwise classification problem, i.e. virus and benign classes. So, in both previous methods a discrimination of favorite class from the other class has been the goal. Document frequency-based feature selection method tries to select features that are more frequently present in favorite class and afterwards it applies the same approach on the other class to select further features. The final set of features is resulted from the union of two previously selected feature sets. In Information gain-based feature selection method, features are sorted in descending order and those with highest information gain measures are selected. In fact, the information gain metric shows the correlation degree between a feature and the class labels. The more the value of the information gain metric for a feature, the more capability it would have to discriminate classes. Our proposed method is experimentally compared with these methods and it is shown that it detects viruses far more accurately.
3 Background It was explained earlier that to have a good classification, the representative features that can properly discriminate data samples should be selected. When we use n-gram analysis to extract such features, the first critical decision is the proper choice of n-gram length. Due to the lack of the information implicitly included in different combinations of bytes, single byte n-grams, prove to be ineffective in revealing necessary information. Two byte n-grams are able to catch adjacent bytes dependencies to some extent. By increasing the n-gram length, the number of different combinations of byte values that can be found will increase excessively in an exponential manner that allows catching even more intrinsic characteristics of the executables. But this exponential growth will affect the computational and memory requirements of feature extraction process. It is believed that n-grams of smaller lengths contain the statistics of larger n-grams by a large extent and thus a length of two bytes should do the work [7]. However in practice using larger n-gram lengths, e.g. a length of four [5, 7] has shown better results. In reality, it turns out that we are confronted with a case where we should make a tradeoff between having better features and reducing the computational cost of the algorithm in order to make it feasible. As a way to deal with this difficulty, we propose a new method for n-gram extraction. It allows to considerably decrease the number of potential combinations of bytes while also enabling to analyze the combination of non-adjacent bytes. In this method only certain bytes of the sliding window, used in n-gram extraction, are taken as ngrams and the rest are not considered. In our experiments we have only taken the first and the last bytes of the sliding window to constitute the n-grams. This allows catching non-adjacent dependencies between byte sequences while manipulating a constant number of maximum possible combinations. To enrich the set of possible features used in the classification, different window sizes from 2 to 6 bytes long are used. In this way the gap between
102
H. Parvin et al.
n-gram’s bytes is increased from zero to four bytes. Clearly, extracting these n-grams with the explained different gap sizes, require five consecutive passes over the input file. However, because of the low memory requirement, it is much faster than extracting n-grams of larger lengths, e.g. four, which is used in previous works. Having five n-gram streams of length two (bi-gram) but with different gap sizes extracted from each executable in the training set, a new selection mechanism is applied to select a subset of these n-grams as the final sample features. Because of its general purpose functionality, genetic algorithm [12] proved to be a good search and optimization mechanism for domains with unknown structure. The selection strategy used in this paper employs a genetic algorithm to search the space of all possible n-gram combinations that will result in the best discrimination among benign and malicious executables (two classes that are important for us). An important notion used in this selection mechanism is the reference vector. We define a reference vector to be the one consisting of m sub-vectors, where each one is a binary vector with a size equal to all possible combinations of n bytes values (i.e. a size of 2n*8 entries). This reference vector is later used as the base representation type for input samples as well as the reference “variable set” that chromosome’s genes in GA are taken from. In this work we use a reference vector with m=5 and n=2 resulting in five binary sub-vectors of 216 entries each. The input files (samples) are initially represented by an instance of the reference vector that corresponds to the n-gram streams extracted from them. A value of 1 in an entry of this vector means that the n-gram represented by its index is present in the corresponding input file. We use a genetic algorithm with binary representation for its variables that indicates whether a specific bi-gram is used in its chromosome or not. In order to simplify the chromosomes and to increase the efficiency of the genetic algorithm, the bi-grams represented in the reference vector are processed in groups of 1000 each time. So firstly, the fittest bi-grams of the first 1000 bi-grams of the first sub-vector are selected by GA, and after that the first 1000 bi-grams of the second sub-vector are considered, and then the first 1000 bi-grams of the third sub-vector and so on. When the last sub-vector is reached the algorithm returns to the second 1000 bi-grams of the first sub-vector and the whole process is repeated until a desired number of bi-grams are obtained as selected features. The fitness function used to evaluate the population of chromosomes in the GA takes the following general form:
f C (chromosome ) = AC − AC′
(1)
where AC is the average number of times the features included in the chromosome (those genes having value of 1) have appeared in the files of class C (i.e. malicious or benign), and AC′ is the average number of times that these features have appeared in the files of the complement class, C′. Class C can be either benign or malicious, and therefore C' will be the other one. Once the groups of 1000 bi-grams of each sub-vector are processed and the optimal set of features discriminating class C from class C′ are selected, the whole above process (genetic algorithm) is run once again for obtaining the optimal set of features discriminating class C′ from class C. Therefore the role of C and C′ changes in each
A New N-gram Feature Extraction-Selection Method for Malicious Code
Virus Files
Benign Files
Feature Extractor Dataset
Selected Dataset
Probability Vector
Save Selected Features
103
Project on next 1000 features
Feature Extractor Yes
Clustering Accuracy Improved
No Final Virus Detector
Use AdaBoosting to learn the data
Project Data into All Saved Selected Features
Fig. 1. The proposed Virus Detector Scheme
of these runs. [By introducing a weighting mechanism the role of well separated data points can be degraded in the scoring procedure. When two data points have the same weight, a positive score is added to the chromosome score if the data point is a member of benign files, and a negative score otherwise]. For the feature selection algorithm to be able to decide when it has selected enough features, a data clustering is performed. If the clustering accuracy has reached a predefined threshold which we call it Tc, the algorithm terminates. The best place to perform this evaluation is after each group of 1000 n-grams has been processed. For more detailed information about the proposed method, see Fig 1. And the pseudo code of the described algorithm in Fig 2. 1. 2. 3. 4.
All bi-gram features are extracted from both virus and benign files The selection probabilities of all initial extracted vectors are set to one The new-space (the set of selected features) is initialized to an empty set. The following steps are repeated until the clustering accuracy in the newspace is not decreased 5. According to the selection probabilities and the SUS algorithm a subset of data is selected as the input of genetic algorithm 6. The GA is run and some new features are selected 7. The new-space is updated according to the features selected by GA 8. The data is mapped to the new-space 9. The selection probabilities are updated according to the results of the clustering on the newly mapped data At the beginning, adjacent and various non-adjacent (ranged from one to four) bigrams are extracted according to the discussed method in section 3.2. As our proposed method tries to select the most discriminative features in an evolutionary manner, a boosting scheme has been employed. In general, boosting methods put more emphasis on data samples that have not been adequately learned in previous iterations. This method is another version of boosting method called arc-x4 explained in [13]. Therefore, in the proposed feature selection method, the input vectors of the GA algorithm
104
H. Parvin et al.
Feature Selection Algorithm: Malicious_Features = Extract_Feature(Malicious_Files) Benign_Features = Extract_Feature(Benign_Files) Data_Number = Malicious_Files_Number+Benign_Files_Number P_Select(1.. Data_Number) =1 Previous_Accuracy = 0 Current_Accuracy = 0 Feature_Selected_Sofar = {} While (Previous_Accuracy < Current_Accuracy) Previous_Accuracy=Current_Accuracy [Malicious_Subset,Benign_Subset]=SUS_Selection(Malicious_Features,Benign_Features,P_select) Feature_Selected_Now = Run_GA over the next 1000 features of the selected data Feature_Selected_Sofar =Feature_Selected_Sofar∪ Feature_Selected_Now Newdata=Project_to_New_Feature(Malicious_Features,Benign_Features, Feature_Selected_Sofar) [Result, Current_Accuracy] = K-means(Newdata) Update P_select according to Result of K-means End_While
Fig. 2. Feature Selection Algorithm
are selected according to a boosting probability vector. As it can be observed in the above pseudo code, all entries of this vector are initially set to one. The way that each entry is applied in the next iteration is according to the scale of an adaptive value. The adaptive value is computed according to the distance of the corresponding data from the cluster centroid dominated by the data samples having the same class labels as that data. Next a clustering is performed over the data samples selected according to the probability vector, mapped in the feature space selected so far. This clustering provides cluster centroids which are then used for final clustering of all data samples. Then according to the results of the later clustering, the clustering accuracy is approximated and also the probability vector is updated to be used in the next iteration. The whole algorithm continues as far as the clustering accuracy is not decreased in the following iteration. At the end of feature selection process, a set of promising features are obtained (Xs) that have resulted in a clustering accuracy higher than Tc. Using this set, each of the data samples are assigned with a binary vector of length | Xs | that specifies which of the selected features are present in the corresponding executable. So the inputted data set is simplified to a table of binary vectors. These vectors serve as the final inputs to the classifier system. Because of the inherent unstructured nature of virus detection problem using byte sequences, we preferred to use artificial neural networks (ANNs) as our classifier system. NNs are well recognized classifiers for their capability to cope with unclearly specified problem domains. Artificial neurons are presented as models of biological neurons and as conceptual components of circuits that can perform computational tasks. The training phase of a typical NN consists of adjusting the weights of the links between neurons of the network. An algorithm based on least mean square error is used for training which tries to minimize the classification error of the network based on the inputted training samples. As was mentioned earlier, classifier ensemble is an approach in classification which tries to obtain better classification results by combining several classifiers. One of these methods is boosting in which, several versions of the same classifier are
A New N-gram Feature Extraction-Selection Method for Malicious Code
105
trained on different areas of input domain. In this method a classifier is trained on the sample data, and if its test error is not satisfying a second classifier is trained on the erroneous data points. This process is repeated until the test error rate of the final classifier decreases to a satisfying level. The final classifier ensemble is the combination of all trained classifiers used together.
4 Experiment Results The data samples used in our work consists of 411 malicious and 416 benign executables. Malicious programs are win32 malwares taken from VX heavens site [14] and the benign executables are taken from the system32 folder of Microsoft windows operating system. To avoid any harmful effect, the n-gram extraction process is done on the Suse Linux operating system. The resulted n-gram streams are then used in the Matlab software from Mathworks, for selection and classification. The genetic algorithm uses binary tournament selection and uniform crossover with each parent having a chance of 50% to transmit its genes, and the crossover probability of 0.8. Mutation is performed with a probability of 0.01. Complete replacement strategy is used to incorporate new offspring into population. The termination criteria are set to be 200 generations of changing population or 20 generations of almost stable population. A population size of 5000 individuals is used for evolution. We used the simple k-means algorithm as the clustering technique in this work to evaluate the features selected for classification with an accuracy threshold (Tc) of 80%. After selecting the appropriate features the complete data set of benign and virus executables is divided to training and test parts using 4-fold cross validation. To examine the usefulness of the extracted n-grams with different gap sizes on the classification accuracy, they are compared with the case when only adjacent bytes are used. Fig 2 shows the result of this experiment in terms of ROC curves. As it can be seen, the classifier that is using this type of features completely dominates the one that is using only adjacent two byte n-grams. The results presented in this figure are generated when we use the proposed selection mechanism. In Fig 3 (Left), the green curve represents the ROC of classification based on adjacent bi-gram features selected by document frequency-based method which is known to have the best reported results. The blue curve represents the ROC of classification based on adjacent and non-adjacent bi-gram features selected by the proposed method. In both cases the same number of features is selected. As it can be observed, the proposed method outperforms this method in terms of classification accuracy. Fig 3 (Middle), depicts the ROC curves comparing the proposed method with document frequency-based feature selection method [7] where non-adjacent features are also considered by the method. The green curve represents the ROC of classification based on both adjacent and non-adjacent bi-gram features selected by document frequency-based method. The blue curve represents the ROC of classification based on adjacent and non-adjacent bi-gram features selected by the proposed method. In both cases the same number of features is selected. As it can be observed, the proposed method outperforms this method in terms of classification accuracy. As it is obvious from the above Figs, the inclusion of non-adjacent features significantly improves the classification accuracy of the document frequency-based method.
106
H. Parvin et al.
Fig. 3. ROCs of the proposed method versus the document frequency-based method. (Left). ROCs of the proposed method versus the document frequency-based method using the same set of extracted features (Middle). ROCs of the proposed method versus the information gainbased method using the same set of extracted features (Right).
In the Fig 3 (Right), the green curve represents the ROC of classification based on both adjacent and non-adjacent bi-gram features selected by information gain-based method which is known as one of the most accurate methods in the virology literature [8]. The blue curve represents the ROC of classification based on adjacent and nonadjacent bi-gram features selected by the proposed method. In both cases the same number of features is selected. As it can be observed, the proposed method outperforms this method in terms of classification accuracy.
5 Conclusion and Future Works A new feature extraction technique based on n-gram analysis was proposed in this paper that uses non-adjacent sequences of bytes with different gap sizes in addition to adjacent bytes to extract better features for virus code detection. The proposed technique can catch important dependencies between non-adjacent byte sequences, while it does not require the high space and computational costs of extracting n-grams of larger sizes. The presented experiment results have also confirmed the usefulness of this type of feature extraction. Accompanied with this feature extraction technique, a new feature selection method based on genetic algorithms was also proposed. Using a reference vector, extracted bi-grams are processed in a predefined order. After each couple runs of the genetic algorithm a data clustering is performed to decide whether enough features has been selected or not. The set of selected features are finally used to represent data samples as binary vectors. These vectors are fed to a classifier that performs the main classification of executables into benign or malicious. Some experiments were conducted to evaluate the efficiency of the proposed method. The results suggest that the proposed method significantly outperforms the best earlier known methods in the virology field. It demonstrates improvements in terms of both feature extraction and feature selection.
References 1. Mitchell, T.: Machine Learning. Prentice Hall, Englewood Cliffs (1997) 2. Schultz, M., Eskin, E., Zadok, E., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of the IEEE Symposium on Security and Privacy, pp. 38–49 (2001)
A New N-gram Feature Extraction-Selection Method for Malicious Code
107
3. Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious code using n-grams signatures. In: PST, pp. 193–196 (2004) 4. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, pp. 412–420 (1997) 5. Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004) 6. Cohen, F.: Computer Viruses - Theory and Experiments. IFIP-TC11 Computers and Security 6, 22–35 (1987) 7. Reddy, D.K.S., Pujari, A.K.: N-gram analysis for computer virus detection. Journal in Computer Virology 2(3), 231–239 (2006) 8. Morin, B., Mé, L.: Intrusion detection and virology: an analysis of differences, similarities and complementariness. Journal of Computer Virology, vol 3, 39–49 (2007) 9. Filiol, E.: Computer viruses: from theory to applications. Springer, New York (2005) 10. Adleman, L.M.: An Abstract Theory of Computer Viruses. In: Goldwasser, S. (ed.) CRYPTO 1988. LNCS, vol. 403, pp. 354–374. Springer, Heidelberg (1990) 11. Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. The Journal of Machine Learning Research 7, 2721–2744 (2006) 12. Minaei-Bidgoli, B., Kortemeyer, G., Punch, W.F.: Optimizing Classification Ensembles via a Genetic Algorithm for a Web-Based Educational System. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 397–406. Springer, Heidelberg (2004) 13. Breiman, L.: Arcing classifiers. The Annals of Statistics 26(3), 801–823 (1998) 14. http://vx.netlux.org/
A Robust Learning Model for Dealing with Missing Values in Many-Core Architectures Noel Lopes1,2 and Bernardete Ribeiro1,3 1
CISUC - Center for Informatics and Systems of University of Coimbra, Portugal 2 UDI/IPG - Research Unit, Polytechnic Institute of Guarda, Portugal 3 Department of Informatics Engineering, University of Coimbra, Portugal
[email protected],
[email protected]
Abstract. Most of the classification algorithms (e.g. support vector machines, neural networks) cannot directly handle Missing Values (MV). A common practice is to rely on data pre-processing techniques by using imputation or simply by removing instances and/or features containing MV. This seems inadequate for various reasons: the resulting models do not preserve the uncertainty, these techniques might inject inaccurate values into the learning process, the resulting models are unable to deal with faulty sensors and data in real-world problems is often incomplete. In this paper we look at the Missing Values Problem (MVP) by extending our recently proposed Neural Selective Input Model (NSIM) first, to a novel multi-core architecture implementation and, second, by validating our method in a real-world financial application. The NSIM encompasses different transparent and bound (conceptual) models, according to the multiple combinations of missing attributes. The proposed NSIM is applied to bankruptcy prediction of (healthy and distressed) French companies, yielding much better performance than previous approaches using pre-processing techniques. Moreover, the Graphics Processing Unit (GPU) implementation reduces drastically the time spent in the learning phase, making the NSIM an excellent choice for dealing with the MVP. Keywords: Missing values, Neural Networks, GPU.
1
Introduction
Pattern recognition is an important area of research in the Machine Learning (ML) field, with a respectable and long history [11]. In particular, classification received a great deal of attention from researchers. As a result, a large number of algorithms and approaches have been developed [5], supporting for the emergence of successful real-world applications in a wide range of domains [4,3,12]. Classification algorithms attempt to discover the underlying relationship between a set of input variables and the desired (target) classes, based on a pool of instances (samples) that typically cover only a small portion of the input space. Thus, the quality of the resulting models (classifiers) depends not only on the algorithms being used, but also on the quality and quantity of the available data. Moreover, usually algorithms are designed based on the assumption that ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 108–117, 2011. c Springer-Verlag Berlin Heidelberg 2011
Robust Model for Dealing with Missing Values Using GPU
109
the data does not contain missing and/or invalid values. However, in practice, the data samples obtained for many real-world problems are often incomplete and may contain a large number of unknown (missing) values. This is backed up by the fact that almost a half (45%) of the datasets in the UCI Machine Learning Repository (widely used for benchmarking) contain missing values [4]. Thus, the ability to cope with missing data has become a fundamental requirement for pattern classification. Failure to handle missing data properly will most likely result in large errors and bad generalization issues [4]. Examples of situations where data may contain missing values include: survey questionnaires where people typically left unanswered questions; industrial experiments where mechanical/electronic failures may happen during the data acquisition process; medical diagnosis where diferent patients perform different tests [4]. The remainder of the paper is organized as follows. In the next section we describe several techniques that are being used to handle the missing values problem (MVP) and present the contributions of our approach. In Section 3 we describe the proposed method NSIM. Section 4 presents its GPU implementation and Section 5 the results obtained in a real-world problem of bankruptcy prediction of French companies using a dataset with thousands of samples with MV. Finally, in section 6 conclusions and future work are addressed.
2
Related Work with the Missing Values Problem (MVP)
Since many classification algorithms (e.g. Support Vector Machines (SVMs), Neural Networks (NNs)) cannot directly handle MV, a common practice is to rely on data pre-processing techniques to deal with them. Usually, this is accomplished by using imputation or simply by removing instances (samples) and/or features (attributes, variables) containing missing values [4,1,9,2,13,5]. A review of the methods and techniques to deal with this problem, including a comparison of some well-known methods, can be found in Laencina et al. [4]. Removing features and/or instances containing a large fraction of missing values is a common and appealing approach for dealing with MV, because it is a simple process and reduces the dimensionality of the data (therefore potentially reducing the complexity of the problem). However, for some problems the number of instances available is reduced and removing samples with missing values is simply not affordable. Furthermore, if the instances (with missing observations) eliminated are not similar to the remaining patterns (instances), the resulting models could present bad generalization performance [9]. Likewise, removing features assumes that their information is either irrelevant or it can be compensated by other variables. However, this is not always the case and features containing missing values may have vital (critical) information which cannot be compensated by the information contained in the other features. An alternative for deleting data containing missing values consists of estimating their values. Many algorithms have been developed for this purpose (e.g. weighted k-nearest neighbor approach, Bayesian principle component analysis, local least squares) [4,13,1]. However, wrong estimates of crucial variables can substantially weaken the capacity of generalization of the resulting model and
110
N. Lopes and B. Ribeiro
originate unpredicted and potentially dramatic results. Moreover, models created using imputed (estimated) data consider missing values as if they are the real ones (albeit their value is not known), therefore, the resulting conclusions do not show the uncertainty produced by the absence of such values. Furthermore, statistically, the variability or correlation estimations can be strongly biased [9]. Multiple imputation techniques (e.g. metric matching, bayesian bootstrap) take into account the variability produced by the absence of missing values, by replacing every missing value by two or more acceptable (plausible) values, representing a distribution of possibilities [9]. However, multiple imputation may increase drastically the size of the datasets and therefore the complexity of the problems, in particular when the number of missing values present is high. Moreover, although the variability is taken into account, missing values will still be treated as if they are real. Furthermore, imputation methods were conventionally developed and validated under the assumption that missing values occur in a random manner. However, this assumption does not always hold in practice. In particular in the microarray experiments the distribution of missing entries is highly non-random due to technical and/or experimental conditions [13]. Recently we presented a new method that empowers the well-known BackPropagation (BP) and the Multiple Back-Propagation (MBP) algorithms with the capacity of directly handling missing values [8]. Instead of relying on data pre-processing techniques the proposed method creates a Neural Selective Input Model (NSIM) that accommodates different transparent and bound NN models, providing the support for handling missing values efficiently. Unlike other methods, the models generated take into account and reflect the uncertainty caused by unknown values. In this paper we extend our work by providing a Graphics Processing Unit (GPU) implementation of the NSIM. Furthermore, we also show that applying the NSIM to a financial setting of the French companies enhances the bankruptcy prediction model by increasing its performance. The motivation is twofold: first, GPUs have proven to be able to decrease considerably the long training times associated to NNs [7]. Thus, extending the GPU implementation described in Lopes and Ribeiro [7] is fundamental to overcome one of the main limitations of NNs (their long training times), when applying the proposed method. Second, although the results obtained previously on several benchmarks yielded excellent results [8], applying the NSIM to a real-world problem is important not only to validate it but also to demonstrate its usefulness.
3
Neural Selective Input Model (NSIM)
The building blocks of the NSIM are the selective activation (actuation) neurons, whose importance (contribution) to the NN depends on the pattern (stimulus) being presented [6]. For each neuron k, an importance factor, mpk , is used to define its relevance and contribution, when pattern p is presented to the network. Its output, ykp , is given by (1): ykp = mpk Fk (apk ) = mpk Fk (
N j=1
wjk yjp + θk ) ,
(1)
Robust Model for Dealing with Missing Values Using GPU
111
where Fk the neuron activation function, apk its activation, θk the bias and wjk the weight of the connection between neuron j and neuron k. The farther from zero mpk is the more important the neuron contribution becomes. On the other hand, a value of zero means the neuron is completely irrelevant for the network output and one can interpret such a value as if the neuron is not present in the network. Considering the BP algorithm, the input weights associated to these neurons are updated using the same rule that is used for standard neurons, i.e. after presenting a given pattern p the network, the weights are adjusted by (2): Δp wjk = γδkp yjp + αΔq wjk ,
(2)
where γ is the learning rate, δkp the local gradient of neuron k, Δq wjk the weight change wjk for the last pattern q and α the momentum term. However, the equations of the local gradient for the output, o, and hidden, h, neurons, given respectively by (3) and (4), differ from the standard neuron equations (unless the importance factor is considered to be constant and equal to 1):
δop = (dpo − yop )mpo Fo (apo ) ,
δhp = mph Fh (aph )
No
δop who .
(3) (4)
o=1
Let Vi be a random variable with Bernoulli distribution representing the act of obtaining the value of xi (Vi ∼ Be(pi )). To deal with missing values we propose transforming the values of xi by taking into account Vi , as shown in (5): xi = f (xi , Vi ) .
(5)
This transformation can be carried out by a neuron, k, with selective activation (named selective input), containing a single input, xi , and an importance factor mk identical to Vi , in which case (5) can be rewritten as (6) using (1): xi = Vi Fk (wik xi + θk ) .
(6)
When a given value xi can not be obtained the selective input associated to it will behave as if it does not exist, since Vi will be zero. On the other hand, if the value of xi is available (Vi = 1), the selective input will actively participate on the determination of the network outputs. This can be viewed as if there are two different models, bound to each other, sharing information. One model for the case where the value of xi is known and another one for the case where it can not be obtained (is missing). Figure 1 shows the physical model (NSIM) of a network containing a selective input and the two conceptual models inherent to it. A network with N selective inputs will have 2N different models bonded to each other and constrained in order to share information (network weights). It is guarantee that all the models share at least S parameters, being S equal to the number of weights that the network would have if the inputs with missing values were not considered at all [8].
112
N. Lopes and B. Ribeiro Conceptual models
Physical model
Model 1 when x3 is missing: V3 = 0 x1
y1
x2
x1
y1
x2
y2
y2
x3 V3
Model 2 when the value of x3 is known: V3 = 1
θk multiplier xi
wik
×
xi
x1
y1
x2 Vi selective input neuron
x3
y2
Fig. 1. Physical and conceptual models of a network with a selective input (k = 3)
Although conceptually there are multiple models, from the point of view of the training procedure there is a single model (NSIM). When a pattern is presented to the network, only the parameters (weights) directly or indirectly related to the inputs with known values are adjusted (observe equations (3) and (4)). Thus, only the relevant (conceptual) models will be adjusted [8]. The NSIM presents a high degree of robustness, since it is prepared to deal with faulty sensors. If the system which integrates the NSIM realizes a given sensor has stopped working it can easily deactivate (discard) all the models inherent to that specific sensor, by setting Vi = 0. Thus, consequently the best model available for the remaining sensors working properly will be considered.
4
GPU Parallel Implementation
Our GPU implementation of the referred method extends the BP and MBP implementation presented in Lopes and Ribeiro [7]. A total of three new kernels (special C functions that are executed in parallel on the GPU) were added to the CUDA (Compute Unified Device Architecture) implementation. In order to calculate the outputs of the selective input neurons, xi , a kernel, named FireSelectiveInputs, was created. This kernel, whose code is shown in Figure 2, assumes that standard inputs may coexist with selective inputs. Thus, it should be launched with one thread per input and pattern (regardless of the type of inputs – selective or standard). Moreover, since our implementation considers the batch training mode, the xi variables will be calculated simultaneously for all the training patterns (samples) and the threads should be grouped in blocks containing all the inputs of a given pattern. Of course, for standard inputs the value of xi must match to the original input (xi = xi ). Therefore, to differentiate standard inputs from the selective ones, the value of the weights and bias of the
Robust Model for Dealing with Missing Values Using GPU
#define #define #define {
113
threadIdx.x blockDim.x PATTERN blockIdx.x NEURON
NUM NEURONS
global int idx =
void FireSelectiveInputs(float * inputs, float * weights, float * bias, float * outputs) PATTERN
*
NUM NEURONS
+
NEURON;
float o = inputs[idx]; if (isnan(o) || isinf(o)) { // missing value o = 0.0; } else { float w = weights[NEURON]; float b = bias[NEURON]; if (w != 0.0 || b != 0.0) o = tanh(o * w + b); } outputs[idx] = o; }
Fig. 2. FireSelectiveInputs kernel. CUDA specific keywords appear in bold.
standard inputs is set to zero – the kernel checks this condition to determine which type of input is being handle. Divergence is avoided when all the inputs are selective inputs. Thus, the maximum performance of this kernel is obtained when we treat all the inputs as selective inputs. For the back-propagation phase two more kernels were created: CalcLocalGradSelectiveInputs and CorrectWeightsSelectiveInputs. The first one calculates the local gradients of the selective input neurons for all patterns and the latter is responsible for adjusting the weights of the selective input neurons. As in the case of the FireSelectiveInputs kernel, maximum performance is achieved when all the inputs are considered to be selective inputs. A complete and functional implementation of this method was integrated in the Multiple Back-Propagation software. The latest version of this software as well as its source code can be freely obtained at http://dit.ipg.pt/MBP. Moreover, the NSIM will also be included in the GPUMLib – an open source GPU machine learning library (available at http://gpumlib.sourceforge.net/).
5
Financial Distress Prediction
In recent years, due to the global financial crisis (triggered by the sub-prime mortgage crisis), the rate of insolvency has been aggravated globally. As a result investors are now more careful entrusting their money. Moreover, determining whether or not firms are healthy is of major importance, not only to investors and stakeholders but also to everyone else that has a relationship with the analyzed firms (e.g. suppliers, workers, banks, insurance firms). Although this is a widely studied topic, estimating the real healthy conditions of firms is becoming a much harder task, as companies become more complex and develop sophisticated schemes to conceal their real situation. In this context, automated pattern recognition systems that can accurately predict the risk of insolvency and warn, in advance, all those who may be affected by a bankruptcy process are of major importance. Furthermore, it is common to have incomplete observations
114
N. Lopes and B. Ribeiro Table 1. Financial ratios used to create the bankruptcy model Financial ratios
Financial Debt / Capital Employed (%) Capital Employed / Fixed Assets Depreciation of Tangible Assets (%) Working Capital / Current Assets Current Ratio Liquidity Ratio Stock Turnover days Collection Period Credit Period Turnover per Employee Interest / Turnover Debt Period (days) Financial Debt / Equity (%) Financial Debt / Cashflow Cashflow / Turnover (%)
Working Capital / Turnover (days) Net Current Assets / Turnover (days) Working Capital Needs / Turnover (%) Export (%) Value Added per Employee Total Assets / Turnover Operating Profit Margin (%) Net Profit Margin (%) Added Value Margin (%) Part of Employees (%) Return on Capital Employed (%) Return on Total Assets (%) EBIT Margin (%) EBITDA Margin (%)
(missing data) in financial and business applications [4]. Thus, this in an interesting problem to test (and validate) the proposed method for handling MV. In this study, we used a large database of French companies, containing information of an ample set of financial ratios spawning over a period of several years. The database contains information about 107,932 companies, out of which 1,653 become insolvent in 2006. The objective consists of discriminating between healthy and distressed companies based on the record of the financial indicators from previous years. For this purpose, we considered 29 financial ratios over the immediate previous three years (see Table 1) as well as two more features: the number of employees and the turnover totalizing 89 features. On average each financial ratio, for a given year, contained over 4% of missing values. However, some had almost a third of the data missing. What is more interesting, is that if we consider only the data from distressed companies then the average of MV for the financial ratios raises up to 42.35%. In fact, it is observed that there are many features that contain less than a quarter of the data. We are unsure why this happens, but one possible explanation is that the affected firms could be trying to hide information from the markets. Nevertheless, this highlights the fact that knowing that some information is missing could be as important as knowing the information itself. Thus, in this respect our model is advantageous, since it preserves the missing information (unlike imputation methods). As expected, when looking at the data of each company (sample) we found similar results: overall, on average only 3 or 4 ratios are missing; however when considering only the distressed firms, rougly 37 ratios per sample are missing. Moreover, there are companies for which all the ratios are unknown. To create a workable and balanced dataset, we started by selecting all the instances of the database associated to the distressed companies, whose number of unknown ratios did not exceed 70 (we considered that at least about 20% of the
Robust Model for Dealing with Missing Values Using GPU
115
Table 2. Results of the NSIM for the bankruptcy problem Metric
Results (%)
Accuracy 95.70 ± 1.42 Sensitivity 95.60 ± 1.61 Specificity 95.80 ± 1.83
Metric
Results (%)
Precision 95.82 ± 1.77 Recall 95.60 ± 1.61 F1 measure 95.70 ± 1.35
180
Speedup (×)
160
140
120
100
80 20
30
40
50 60 70 Hidden Layer Neurons
80
90
100
Fig. 3. GPU speedups obtained for the bankruptcy problem
ratios should contain information). Thus, a total of 1524 samples associated to distressed companies were chosen. Then we selected the same number of samples associated to healthy companies, in order to obtain a balanced dataset so that the missing values were uniformly distributed by all the ratios. The resulting dataset contains 3048 instances – a number over 5× greater than the number of samples we were able to obtain in previous work [14,10], using imputation methods. The resulting dataset contains on average 27.66% of missing values per ratio. Moreover, on average 24 ratios per sample are missing. Table 2 presents the results of the NSIM, with the MBP algorithm, using a 10-fold cross-validation. These excel by far the results previously obtained [14,10] when imputation techniques were used and demonstrate the validity and usefulness of the NSIM in a real-world setting. One of the strengths of the NSIM relies on the possibility of using data with a large number of missing values. This is important, because better (and more accurate) models can be built by incorporating and taking advantage of extra information. Moreover, instead of injecting inaccurate values into the system, as imputation methods do, the NSIM preserves the uncertainty caused by unknown values increasing the model utility when relevant information is missing. Figure 3 shows the speedups obtained, for the bankruptcy problem, using an GTX 280 GPU and a Core 2 Quad CPU Q9300 (2.5 GHz). These are consistent with the results previously obtained in Lopes and Ribeiro [7] and demonstrate
116
N. Lopes and B. Ribeiro
the potential of the GPU to reduce significantly the long times and the fastidious (and expensive) task during training of NNs. Moreover, the GPU implementation scales better than the standalone counterpart, by taking advantage of additional parallel operations.
6
Conclusions and Future Work
The ability to deal properly with missing values has become a fundamental issue for pattern recognition, as data samples in many real-world problems are often incomplete. Failure to correctly handle missing data will most likely result in larger errors and inaccurate models with poor performance. Moreover, there are situations where sensors may fail, yet systems are expected to take decisions based on the available data. In this paper we addressed this problem by presenting a GPU implementation of an innovative method that integrates the capacity for handling directly MV in neural networks. The NSIM has several advantages as compared to other methods: (i) It presents a higher degree of robustness – the resulting models are able to deal with faulty sensors, by selecting the best model available for the sensors working properly; (ii) it preserves the uncertainty caused by unknown values, instead of injecting inaccurate values into the system; (iii) data containing valuable information, that could be discarded otherwise due to a large number of MV, can now be incorporated into the models; and (iv ) prevents undesirable bias. This is validated in a real-world problem of bankruptcy prediction that attests for the quality and usefulness of the proposed method. Moreover, its GPU implementation is crucial to reduce the long training times associated to NNs, thus making this method more attractive. Future work will exploit selective input neurons on other type of neural networks.
Acknowledgment FCT (Funda¸c˜ao para a Ciˆencia e Tecnologia) is gratefully acknowledged for funding the first author with the grant SFRH/BD/62479/2009.
References 1. Aikl, L., Zainuddin, Z.: A comparative study of missing value estimation methods: Which method performs better? In: Proc. International Conference on Electronic Design (ICED 2008), pp. 1–5 (2008) 2. Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z.: Dynamic clustering-based estimation of missing values in mixed type data. In: DaWaK 2009: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery, pp. 366–377. Springer, Heidelberg (2009) 3. Friedman, M., Kandel, A.: Introduction to Pattern Recognition: Statistical, Structural, Neural, and Fuzzy Logic Approaches. World Scientific, Singapore (1999) 4. Garc´ıa-Laencina, P., Sancho-G´ omez, J.L., Figueiras-Vidal, A.: Pattern classification with missing data: a review. Neural Computing & Applications 19, 263–282 (2010)
Robust Model for Dealing with Missing Values Using GPU
117
5. Kotsiantis, S.B., Zaharakis, I.D., Pintelas, P.E.: Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26(3), 159–190 (2006) 6. Lopes, N., Ribeiro, B.: Hybrid learning in a multi-neural network architecture. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2001), vol. 4, pp. 2788–2793 (2001) 7. Lopes, N., Ribeiro, B.: GPU implementation of the multiple back-propagation algorithm. In: Corchado, E., Yin, H. (eds.) IDEAL 2009. LNCS, vol. 5788, pp. 449–456. Springer, Heidelberg (2009) 8. Lopes, N., Ribeiro, B.: A strategy for dealing with missing values by using selective activation neurons in a multi-topology framework. In: IEEE World Congress on Computational Intelligence, WCCI (2010) 9. L´ opez-Molina, T., P´erez-M´endez, A., Rivas-Echeverr´ıa, F.: Missing values imputation techniques for neural networks patterns. In: ICS 2008: Proceedings of the 12th WSEAS International Conference on Systems, pp. 290–295. World Scientific and Engineering Academy and Society, WSEAS (2008) 10. Ribeiro, B., Lopes, N., Silva, C.: High-performance bankruptcy prediction model using graphics processing units. In: IEEE World Congress on Computational Intelligence, WCCI (2010) 11. Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, New York (2008) 12. Tang, H., Tan, K.C., Yi, Z.: Neural Networks: Computational Models and Applications (Studies in Computational Intelligence). Springer-Verlag New York, Inc., Secaucus (2007) 13. Tuikkala, J., Elo, L., Nevalainen, O., Aittokallio, T.: Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinformatics 9(1), 202 (2008) 14. Vieira, A.S., Duarte, J., Ribeiro, B., Neves, J.C.: Accurate prediction of financial distress of companies with machine learning algorithms. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 569–576. Springer, Heidelberg (2009)
A Model of Saliency-Based Selective Attention for Machine Vision Inspection Application Xiao-Feng Ding1 , Li-Zhong Xu1,2, , Xue-Wu Zhang1 , Fang Gong1 , Ai-Ye Shi1,2 , and Hui-Bin Wang1,2 1
2
College of Computer and Information Engneering, Hohai University, Nanjing 210098, China Institute of Communication and Information System Engineering, Hohai University, Nanjing 210098, China
[email protected],
[email protected]
Abstract. A machine vision inspection model of surface defects, inspired by the methodologies of neuroanatomy and psychology, is investigated. Firstly, the features extracted from defect images are combined into a saliency map. The bottom-up attention mechanism then obtains “what” and “where” information. Finally, the Markov model is used to classify the types of the defects. Experimental results demonstrate the feasibility and effectiveness of the proposed model with 94.40% probability of accurately detecting of the existence of cropper strips defects. Keywords: Vision inspection, Surface defect, Saliency map, Selective attention, Markov model.
1
Introduction
Since the 1990’s, with the rapid development of electronic technology and machine vision technology, machine vision inspection of surface defects has gradually become the most important non-destructive detection technology. The difficulties of machine vision inspection of surface defects are mainly about the defect feature extraction and defect classification. In traditional machine vision inspection, individual feature such as gray feature [1], geometry feature [2] and texture feature, and with their combinations are used to describe the defect images. Then, neural networks (NN) [4] or support vector machines (SVM) [5] classify the surface defects. These methods achieve surface defect detection and classification in a certain extent. However, in copper strip surface defect inspection, the copper strip surface is highly reflective and different production technologies of copper strips lead to different types of surface defects. Furthermore, some kinds of defects are small and their intensities are similar to non-defective copper strip surfaces. The performance of traditional methods are poor in vision inspection of copper strips and can not meet the high demands on quality control. Human visual inspection can identify what kinds
Corresponding author.
A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 118–126, 2011. c Springer-Verlag Berlin Heidelberg 2011
Model of Saliency-Based Selective Attention for Machine Vision Inspection
119
of the defects are and where the defects locate, quickly and effectively. Human visual inspection are robust when the reflective intensity and the shapes of defects change. The machine can not identify the differences of the same defects caused by different production technology and the slight defects, which are not too difficulty to the human visual inspection. This paper is motivated by the need for an automated inspection techniques by imitating human visual that detects and locates defects in copper strips. Human vision has a strong ability in pattern recognition and image understating. Human has a remarkable ability to interpret complex scenes in real time. Intermediate and higher visual processes appear to select a subset of the available sensory information before further processing, most likely to reduce the complexity of scene analysis [8]. This selective attention is a vital process in vision; it facilitates the identification of important areas in a visual scene. It has been described as a spotlight, illuminating a particular region while neglecting the rest. Corbetta [12] has characterized this selection of a region as necessary because of “computational limitations in the brain’s capacity to process information and to ensure that behavior is controlled by relevant information”. Moreover, research done on attention mechanisms in the brain has been useful in identifying areas of the visual system as well as their behavior and function. With the development of Neuroscience, computational neuroscience and anatomy, research on human vision perception system is increasing constantly. The computational models [6,7,8,9,11,13] for selective attention, in both biological and computer vision, are particularly useful in image understanding. In this paper, we presented a selective attention model combined the saliency map for machine vision inspection of surface defects. The features of the defect images are extracted to obtain a saliency map [9]. Then, the observable Markov model is used into the attention mechanism of task-driven. It combined topdown attention with bottom-up attention, and takes “what” information and “where” information into account and then completes surface defect inspection. The paper is structured as follows: In Section 2, the saliency-based selective attention model is given. Then, the experimental results on the copper strips are reported. In the last section, we conclude and discuss future work.
2
Saliency-Based Selective Attention Model
In this section, we start by obtaining saliency map of defect images, then the acquisition of “what” and “where” information, finally the integration of “what” and “where” information. 2.1
Saliency Map
Visual Feature Extraction. The choice of a specific set of features is not crucial. In this work we choose on a feature decomposition proposed by Itti and Koch [9]. The image to process is first subject to a feature decomposition into an intensity map (I) and four broadly-tuned color channels (R, G, B, and Y ) are established: I = (r + g + b)/3 for intensity, R = ˜ r − (˜ g + ˜b)/2+ for
120
X.-F. Ding et al.
red, G = ˜ g − (˜ r + ˜b)/2+ for green, B = ˜b − (˜ r + g˜)/2+ for blue, and R = (˜ r + g˜)/2 − |˜ r − g˜|/2 − ˜b+ for yellow, where r˜ = r/I, g˜ = g/I, ˜b = b/I and x+ = max(x, 0). I, R, G, B, and Y are used to create Gaussian pyramid I(σ), R(σ), B(σ) and Y (σ), where σ ∈ [0, ..., 8] is the scale factor. I is also used to create Gabor pyramids O(σ, θ), where σ ∈ [0, ..., 8] is the scales and θ = {00 , 450 , 900 , 1350 } is the preferred orientation. Feature Conspicuity Map. The feature maps are obtained by calculating the center-surround differences between a “center” fine scale c and a “surround” coarser scale s with the extracted intensity, color and orientation features. The calculation denotes as . Intensity feature map is relative to intensity contrast, which is detected by sensitive neurons sensitive either to dark centers on bright surrounds, or to bright centers on dark surrounds in mammals. Here, it calculates six features for intensity I(c, s), where c ∈ {2, 3, 4}, σ ∈ {3, 4}, s = c + σ. I(c, s) = |I(c) I(s)|
(1)
The color feature maps are calculated by Rg(c, s) = |((R(c) − G(c)) (G(s) − R(s)))|,
(2)
By(c, s) = |((B(c) − Y (c)) (Y (s) − B(s))|.
(3)
The orientation feature maps are calculated by O(c, s, θ) = |O(c, θ) O(s, θ)|.
(4)
When intensity, color and orientation feature maps are down, we combine the three kinds of features. Before this, the local iterative method is used to each feature map. The concrete realization process is first normalizing each feature map to the same range, then convolving them by a large two-dimensional difference of Gaussians filter. s ← |S + S ∗ DoG − Cinh | > 0, (5) where DoG(x, y) =
2 2 Cex −(x2 + y 2 ) Cinh −(x2 + y 2 ) exp( )− exp( ), 2 2 2 2 2πσex 2σex 2πσinh 2σinh
(6)
∗ stands for convolution operation, DoG is Gauss differential function, σex and 2 2 σinh are excitation and suppression bandwidth, Cex and Cinh are excitation and suppression constants. Let Cinh be an offset. So that the combination strategy can restrain the balanced region broadly, such as well-distributed texture images. Using Gauss differential function for local iteration, on the one hand, it can detect more significant targets, on the other hand, the Gauss differential function is similar to the center self-excitation of main visual cortex of human eye and organizational form of restrained long-range linked on surrounds. Therefore, it
Model of Saliency-Based Selective Attention for Machine Vision Inspection
121
has rationality of physiology, and can effectively restrained the noise using multiresolution. After obtaining the intensity, color, orientation feature maps I , C and O , it can get the final saliency map through weight average, that is S= 2.2
1 (I + C + O ). 3
(7)
Acquisition of “what” and “where” Information
When the saliency map is calculated, we can get the intensity, color and orientation features, which can be directly used as “what” information. However, in order to analyze the content of fovea centralis more effectively, the experts network composed by single-layer perceptron in every area are used to get “what” information. Inputs are extractive features vectors from captured information in fovea centralis. Outputs are the posteriori probability vectors of the information category, which are treated as “what” information required for this article. Single layer perceptron is training through supervised learning. Attention focus selection and diverting determines the location and importance of interest region. The competition of various targets in interest image is implemented by winnertake-all competition mechanism. Firstly, the winner-take-all Neural Network finds attention focus from saliency map, selects candidate regions to get saliency area, then returns to the effect of inhibition mechanism to look for the next saliency point to divert attention focus. According to scanning time sequence of the access points in simulation, these points form a scanning path, which are treated as “where” information flow. 2.3
Integration of “what” and “where” Information
Discrete observable Markov model is used to connect the saliency map and “what” and “where” information in combined layer module. Region visited by attention focus treated as "where" information is used as the state of Markov models, and output of expert network treated as "what" information is used as condition observations. The focus diverting sequences of each sample in training set form a scanning path in time order, which corresponds to a Markov chain of the training samples category. The model adjusts the probability of single Markov chain based on “what” and “where” information, thereby maximizing the specific scanning path likelihood values of certain training sample, and by selecting a class which has the largest posteriori probability that implements recognition. Observable Markov Model. In the training process, Markov model simulates a certain number of scanning path. So each state can be observed and the state diverting probability aij , the initial distribution probability πij can be obtained through count method. Similarly, the state observations bj (k) are obtained
122
X.-F. Ding et al.
by calculating an output of expert network under each state of each sample. The three parameters are calculated by: time si move to sj move times start from sj
(8)
times of state sequence si at t = 1 total observation sequence
(9)
time sj observing oi total time sj
(10)
aij =
πij =
b(k) =
The probability of observation series is P (O, S/λ) = πSi bSi (O1 )
n
aSi−1 Si bSi (Oi ),
(11)
i=2
where S is the state sequence, O is the observation sequence, λ = {πi , aij , bj (k)} is the parameter of Markov Chain, i, j = 1, ..., N are indices for states, k = 1, ..., M is the index for the observation samples. Let C be the category with the highest observation probability, then P (O, S/λc ) = max P (O, S/λj ). j
(12)
Dynamic Fovea. Advantages of using Markov Model is that the number of scanning is controllable. In the course of identification, just let the recognition images process through limited number of focus diverting, without the need of being noticed of all areas, they can be classified and judged correctly. Every round ended, from a posteriori probability of Markov Model in the class. After each scanning, posteriori probability of class can be obtained from Markov Model. At one point t, regions had been noticed, and the recognition probability of particular type of images is recorded as a(C). The probability of partial sequence of Markov Model is: at (c) = P (O1 , ..., Ot , S1 , ..., St /λc ),
(13)
where O1 , ..., Ot is the observation sequence up to time t, S1 , ..., St is the state sequence, and λc is the parameter of category C with Markov Model. When the probability has reached decisions confidence, the focus stop diverting. At time t, the posterior probability that the image belongs to category C can be defined as: at (c) a∗t (c) = P (C/O1 , ..., Ot , S1 , ..., St ) = k (14) j=1 at (j) Let the confidence be τ . Then, the standard for focus stop diverting is a∗t (c) ≥ τ , τ ∈ [0, 1].
Model of Saliency-Based Selective Attention for Machine Vision Inspection
3
123
Experimental Results and Analysis
To verify the feasible of the proposed approach, the experimental simulations are implemented in the image library from XINGRONG Copper Corporation in Changzhou, Jiangsu Province. This image library contains 1600 640 × 480 copper strip surface images. There are 6 types of defects such as cracks, burrs, scratch, holes, pits and buckles. There are 200 defect images, 200 non-defect images, 200 smearing “false defect” images. In the experiments, a narrow LED lighting device of LT-191 X 18 model from Dongguan technology co. and a CCD industrial camera of JAI CV-A1 model are used to collect the copper strip image. We start by obtaining the saliency maps of defect images. Firstly, use Gaussian pyramid and Gabor pyramid decomposing in different scales, 9 brightness features, 36 color features and 36 orientations features are obtained for each defect image. In all these 81 features, 42 features maps involving 6 brightness feature maps, 12 color feature maps and 24 orientation feature maps are obtained by calculating the central peripheral difference between Central fine scale c and neighboring rough scale s. Then, use local iteration strategy to get I , C and O features map, as shown in Fig. 1. As used in this paper they are static image, fibrillation feature maps do not have any saliency areas. In the experiments, 42 feature maps are taken as the input of local neural network (here using single-layer perceptron), and the output of perception is 10-D class posteriori probability treated as “what” information in this paper. The local neural network is used to reduce the complexity of the system and to improve the classification accuracy.
Fig. 1. Conspicuity maps of smearing. From left to right, up to bottom, they are original image, attention map, conspicuity maps for color contrasts, flicker contrasts, intensity contrasts and orientation contrasts.
124
X.-F. Ding et al.
Table 1. The performances of the observable Markov model and the dynamic central fovea Method
The average scanning number
Observable Markov model Dynamic central fovea
5 3.6
Accuracy Training 97.45 93.56
rate (%) Testing 94.40 89.52
Table 2. Classification accuracy of surface defects detection using observable Markov model Type of defects Smearing Cracks Burrs Scratch Holes Pits Buckles Total
Correct Number 192 186 193 185 195 192 179 1322
Error Number 8 14 7 15 5 8 21 78
Accuracy rate (%) 96.00 93.00 96.50 92.50 97.50 96.00 89.50 94.40
The scanning number of dynamic central fovea is uncertain, just relying on the observable Markov model structure. The performances of the observable Markov model and the dynamic central fovea are given in Table 1. Table 1 presents that the classification accuracy rate of dynamic central fovea is lower than that of Markov model. However, the average scanning number of observable Markov model is 5 which is larger than the dynamic fovea with 3.6 scanning number to complete the classification. So, using the dynamic central fovea can greatly improve the real-time performance. Table 2 presents the classification accuracy of surface defects detection using observable Markov model. This method has a higher recognition rate for all copper strip typical surface defects detection, ranging from 89 to 97. Furthermore, when the defect features have small difference with non-defect image features such as scratches and buckles, the accuracy may achieve 92.5% and 89.5%.
4
Conclusions
This paper investigates a model of saliency-based selective attention for machine vision inspection of copper strip surface defects. The proposed method is capable of detecting the copper strip surface defects although the copper strip surface are highly reflective and different production technologies of copper strips lead to different types of surface defects. The experimental results show that the proposed method improves the classification ability for surface defect inspection system and achieve the requirement of the accuracy. In this paper, we only
Model of Saliency-Based Selective Attention for Machine Vision Inspection
125
consider the static stimuli for saliency map detector. It is an interesting future direction for consider the dynamic scene for copper strip surface inspection. We are also considering to apply the proposed method in other surface defects quality inspection.
Acknowledgments This work is supported partly by the National Natural Science Foundation of China (No. 60872096), the National Natural Science Foundation of Jiangsu Province of China (No. BK2009352) and the Fundamental Research Funds for the Central Universities (No. 2009B20614).
References 1. Zheng, H., Kong, L., Nahavandi, S.: Automatic Inspection of Metallic Surface Defects using Genetic Algorithms. Journal of Materials Processing Tech. 125, 427–433 (2002) 2. Liang, R., Ding, Y., Zhang, X., Chen, J.: Copper Strip Surface Defects Inspection Based on SVM-RBF. In: 4th International Conference on Natural Computation, pp. 41–45. IEEE Press, New York (2008) 3. Zhong, K.-H., Ding, M.-Y., Zhou, C.-P.: Texture Defect Inspection Method using Difference Statistics Feature in Wavelet Domain. Systems Engineering and Electronics 26, 660–665 (2004) 4. Zhang, X., Liang, R., Ding, Y., Chen, J., Duan, D., Zong, G.: The System of Copper Strips Surface Defects Inspection Based on Intelligent Fusion. In: 2008 IEEE International Conference on Automation and Logistics, pp. 476–480. IEEE Press, New York (2008) 5. Li, T.-S.: Applying Wavelets Transform, Rough Set Theory and Support Vector Machine for Copper Clad Laminate Defects Classification. Expert Systems with Applications 36, 5822–5829 (2009) 6. Luo, S.-W.: Information Processing Theory of Visual Perception. publishing house of electronics industry, Beijing (2006) 7. Noton, D., Stark, L.: Eye Movements and Visual Perception. Scientific American 224, 35–43 (1971) 8. Didday, R., Arbib, M.: Eye Movements and Visual Perception: A Two Visual System Model. International Journal of Man-Machine Studies 7, 547–570 (1975) 9. Itti, L., Koch, C., Niebur, E.: A Model of Saliency-based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1254–1259 (1998) 10. Rimey, R., Brown, C.: Selective Attention as Sequential Behavior: Modeling Eye Movements with An Augmented Hidden Markov Model. Department of Computer Science, University of Rochester (1990) 11. Salah, A., Alpaydin, E., Akarun, L.: A Selective Attention-based Method for Visual Pattern Recognition with Application to Handwritten Digit Recognition and Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 420–425 (2002)
126
X.-F. Ding et al.
12. Corbetta, M.: Frontoparietal Cortical Networks for Directing Attention and The Eye to Visual locations: Identical, independent, or overlapping neural systems? Proc. Natl. Acad. Sci. USA 95, 831–838 (1998) 13. Vazquez, E., Gevers, T., Lucassen, M., Weijer, J., Baldrich, R.: Saliency of Color Image Derivatives: A Comparison between Computational Models and Human Perception. J. Opt. Soc. Am. A 27, 613–621 (2010)
Grapheme-Phoneme Translator for Brazilian Portuguese Danilo Picagli Shibata and Ricardo Luis de Azevedo da Rocha Escola Polit´ecnica, Universidade de S˜ ao Paulo, Brazil
[email protected],
[email protected]
Abstract. This work presents an application for grapheme-phoneme translation for Portuguese language texts based on adaptive automata. The application has a module for grapheme-phoneme translation of words as its core, and input texts are transformed into sequences of words (including numbers, acronyms, etc) that are used as input for the word translation module. The word translation module separates words into sequences of tokens and analyzes their behavior considering stress and influences from adjacent tokens. The paper begins with an overview of the word translation method based on adaptive automata, presents the application for text translation and ends with results of translation tests using texts from Brazilian newspapers. Keywords: Adaptive Automata, Brazilian Portuguese, GraphemePhoneme Translation, Natural Language Processing.
1
Introduction
Text-to-Speech translation (TTS) has been an important topic of studies among those of Natural Language Processing. TTS is often divided in two parts: text-tophoneme translation (TTP), where input text is translated to a phonetic representation, and speech synthesis, where the phonetic representation is transformed into speech. Multiple approaches to TTP can be found, including methods based on grapheme-phoneme translation or letter-to-phoneme (L2P) translation in which phonetic representation are discovered from words and letters respectively. This paper presents an application for grapheme-phoneme translation for Portuguese language based on adaptive automata [1]. The application translates texts written in Portuguese to phonetic sequences similar to that spoken in S˜ ao Paulo, Brazil, but may be changed to adhere to different variations of Portuguese and furthermore different languages, such as Spanish, that hold major similarities with Portuguese. Rule based methods may not be the best fit for processing some natural languages, especially the ones that have highly irregular rules for letter-to-phoneme translation, but that is not the case for Portuguese whose language are quite regular. Rule based methods yield good results for this language as stated in [2] and shown by the results of [3], [4] for European Portuguese and [5] for Brazilian Portuguese. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 127–135, 2011. c Springer-Verlag Berlin Heidelberg 2011
128
1.1
D.P. Shibata and R.L. de Azevedo da Rocha
Brazilian Portuguese of S˜ ao Paulo (SPP)
Portuguese is a widespread language, being spoken in different countries and continents with different accents. The Portuguese of S˜ao Paulo has been chosen as the target of this study for familiarity with that language and also because it is spoken in the most populous dialectal region of Brazil [6]. S˜ ao Paulo is still quite large and its population is largely composed of immigrants, therefore it was necessary to standardize the expected output. The standard output for the presented method is based in an illustration of the urban variety of S˜ ao Paulo State dialectal region [6]. The sound rules presented in this paper may be a clue for how SPP sounds, but [6] should be addressed for the complete set of rules.
2
Word Translation
The core of the application is a word translator based on adaptive automata. This word translator module is an implementation of the translation method presented in [5] and [7]. This section presents an overview of the translation method. Translations begin with lexical analysis where words are divided into tokens similar to the syllables of Portuguese language [8]. The tokens are then passed to the adaptive automaton that handles the tokens and treats three issues concerning context sensitivity: stress of the token which define whether the sound of a token should be emphasized over other sounds in the same word, and the influences a token receives from its previous and next tokens that may change the sound of the token. 2.1
Lexical Analysis
Lexical analysis is the first part of the translation process. It separates input words into sequences of tokens that are handled by the adaptive automaton and translated into phonetic sequences considering the appropriate context sensitivity issues. The lexical analyzer rules are based on syllabic separation rules for Portuguese language defined in [7]. The full set of rules can be found in [6]. Table 1 presents examples of words that are separated differently by the mentioned rules. The main difference from the lexical analyzer and the syllabic separation for Portuguese is on the separation of adjacent vowels if the vowels are different from each other. While the separation rules state that separation is conditional to Table 1. Word separation example Word
Separation
Lexical
Sabia Piano Aerado
Sa-bi-a Pi-a-no A-e-ra-do
Sa-bia Pia-no Ae-ra-do
Grapheme-Phoneme Translator for BP
129
context and vowels are not separated if they form diphthongs or triphthongs, the lexical analyzer never separates adjacent vowels that can become diphthongs or triphthongs and separation is made by the automaton when context is analyzed. 2.2
Adaptive Automaton
After the completion of lexical analysis, the sequence of tokens generated is used as the input for an automaton which translates them to sequences of phonetic symbols reckoning the context sensitivity issues mentioned before. As a result of the execution process there may be one or more acceptable phonetic representations for the input sequence. Sometimes only one of these representations is actually used by Portuguese speaking people, but there are cases in which more than one representation is correct and disambiguation must be done through context. Symbols. Symbols used by the automaton are divided in three sets. Tokens are the input symbols generated by the Lexical Analyzer and represent parts of the analyzed word. Context symbols are internal symbols written and read by the automaton to treat context sensitivity issues in a word. Markup symbols are used by adaptive actions to search transitions that indicate places where other translations should be inserted. Context symbols are divided in three subgroups: forward influence symbols that define the influence a token exerts on the following token, backward influence symbols that define the influence a token exerts on the preceding token and stress symbols that define stress for a token. Forward influence symbols. define influences a token exerts on its following token. These symbols are represented by the Greek letter α and are also referred to as α-symbols. Forward influence symbols indicate whether the last character of the influencing token is a vowel or a consonant. Forward influence is not frequent in Portuguese, only tokens that begin with ’r’ or ’s’ followed by vowels suffer this type of influence. Backward influence symbols. define influences a token exerts on its preceding token. These symbols are represented by the Greek letter π and are also referred to as π-symbols. Backward influence symbols indicate the characteristic of the first sound in the influencing token as fricative, nasal, voiced or unvoiced consonants among others. Contrary to forward influence, backward influences are common and a significant number of tokens suffer it. Stress symbols. define stress for a token. These symbols are represented by the Greek letter τ and are also referred to as τ -symbols. Stress symbols indicate whether tokens are stressed or unstressed, and a special symbol indicates if the token is the last of a word and triggers the process of defining stress for all tokens in a word.
130
D.P. Shibata and R.L. de Azevedo da Rocha
Sub Machines. The adaptive automaton is divided in two sub-machines: Recognizer and Translator. The Recognizer reads input symbols and executes adaptive actions that change the structure of the Translator in order to comply with the rules used by Portuguese speakers to read the analyzed word. When the input sequence is over, there is a sub-machine call to Translator that defines the valid phonetic representations for the input word. Adaptive Functions. Changes to the Translator are executed by adaptive actions that are executed when a token is read by the Recognizer. These adaptive actions are composed by sequences of adaptive function calls. The adaptive functions were designed with the purpose of executing small changes to the Translator such as adding or removing a particular transition, and their calls are arranged in blocks that change the Translator in a structured manner. The following adaptive functions were designed: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
dm: indicates where to create new transitions for the automaton. rt : creates transitions to analyze stress. ina: creates transitions to generate backward influence. inp: creates transitions to generate forward influence. ida: creates transitions to read forward influence. idp: creates transitions to read backward influence. som: creates transitions that define the phonetic representation of a token. am: erases a markup transition. af : prepares the Translator to be executed. ra: recognizes existence of acute or circumflex accents in the token.
Order. Sequences of adaptive function calls are divided in two blocks. The first block creates transitions that define influences on adjacent tokens and stress, while the second block creates transitions that read influences from adjacent tokens and transitions that define sounds that represent the token. The second block is composed by multiple forward influence blocks which in turn are composed by multiple backward influence blocks. Forward and backward influence blocks refer to sequences of functions calls that create transitions that handle one specific influence value. Figure 1 presents as an example the sequence of adaptive function calls called when the token sa is read by the Recognizer. The leftmost column contains calls that compose the first block where stress and generated influence rules are defined. Calls from other columns compose the second block which contains two α-blocks and each α-block is divided into three π-blocks. Parameters. Values passed on the parameters of adaptive functions calls are in its majority context symbols that will be read or written by the created transitions. During Translator execution they define how context sensitivity issues are handled by analyzed tokens. Markup symbols and output symbols are passed to search transitions and define the sounds of tokens respectively. There are three combinations of function calls that define the stress rules for a token. The parameters in these three combinations separate tokens in three
Grapheme-Phoneme Translator for BP
131
Fig. 1. Adaptive action for token sa
sets concerning stress rules: tokens that are unstressed when they are the last of a word, tokens that are stressed when they are the last of a word without acute or circumflex accents, and tokens with acute or circumflex accents. Parameters for calls that define influences a token exerts on adjacent tokens are defined based on characteristics of the influencing token. Forward influence is defined by whether the last character in the influencing token is a vowel or a consonant, backward influence is defined by the characteristics of the first sound of the influencing token (nasal, fricative, voiced, unvoiced, etc). Parameters for calls that define influences a token receives from adjacent tokens are based on characteristics of the influenced token. For each relevant forward influence there should be an α-block that handles that influence, and for each relevant backward influence there should be a π-block that handles that influence nested inside each α-block. For the example presented in Figure 1, parameter values for calls that compose the first block define that sa is unstressed final, starts with a fricative consonant and ends with a vowel. In the second block, forward influence blocks defines that ’s’ sounds like [z] when it follows a vowel and like [s] otherwise, while backward influence blocks define that ’a’ is nasalized when token is stressed and followed by nasal consonants, it is voiceless when it is a final token and sounds like [a] otherwise. 2.3
Example
Figure 2 presents the structure of the Translator submachine during the translation of the word casa. States and transitions are represented with usual automata notation. Tags for transitions mean they consume the symbol before comma, write the symbol after comma (omitted if nothing is written) into the input and write the sequence in between brackets in the output. The structure consists of two cyclic blocks of transitions that represent the tokens ca and sa that compose the word. The execution consists of two passes in
132
D.P. Shibata and R.L. de Azevedo da Rocha
Fig. 2. Translation of word casa Table 2. Different contexts for token sa
each block, the first pass (below from right to left) defines stress and backward influences, the second pass (above from left to right) defines forward influence and resolves stress and influences to translate the tokens. Transitions used during execution are highlighted in red. Table 2 presents the use of token sa in different contexts. Blocks are structured in rows (α) and columns (π), defining different sounds for the prefix and the nucleus of the token. Stress (τ ) is also considered in the columns. The original work [5] can be referred for a step-by-step explanation of the adaptive process that changes Translator from its initial configuration to the configurations that represent words, and for other translation examples presenting the behavior of different types of tokens on different contexts. 2.4
Disambiguation
The rules for reading some graphemes of Portuguese may not define clearly the correct form of reading or may sometimes allow more than one correct form of reading. Words with ’x’ starting a token are not clearly defined and the correct sound depends on the origin of the word. Words with ’e’ and ’o’ in the stressed syllable may have two different readings depending on the context.
Grapheme-Phoneme Translator for BP
133
In these cases the automaton generates a set of phonetic representations that may be used for the given word and an auxiliary method is used to define which of the representations in the set will be used as the output. Two sets of disambiguation rules were used, choosing the most probable phonetic representation considering morphological characteristics of the word with and without its partof-speech.
3
Text Translator
The Portuguese Grapheme-Phoneme Translator (PGPT) is an application that translates texts written in Portuguese to phoneme sequences that represent the speech of a native of S˜ ao Paulo reading the input text. The output from text translations are generated according to IPA standards using Unicode symbols based on the appropriate Unicode chart. The translator is based on a word translation module. The implementation of the word translation module is based on the word translation method presented in the previous section, but changes were made to increase execution performance, avoid excessive memory usage and decrease loading time. The word translation module is surrounded by other modules that treat input texts replacing acronyms, numbers and other complex structures into sequences of words that are translated one by one into phonetic sequences. If multiple translations are generated by the automaton, a disambiguation module chooses one of these translations and sends it to the output stream. The application was implemented on Java 5 platform with graphical and command line interfaces for translation of user input and files respectively. The lexical analyzer was implemented with Java’s regular expression package (java.util.regex). It receives words as input and splits them into substrings that represent the tokens that will be used as input for the automaton. An API for adaptive automata execution was implemented and the adaptive actions from the Recognizer and the Transducer structure were built over this API. While the model supposes the preexistence of a Recognizer submachine that handles all possible tokens, in the application this structure is built during translations. Whenever a token is recognized the adaptive action is built and stored in a hash map under that given token for reuse. The Transducer is the exact reproduction of the one presented in the translation methodology over the automata execution API.
4
Results
This section presents a compilation of the results obtained in [5]. The classification was slightly changed, with a reclassification of words that could be fit in two categories as incorrect results. Tests were run using texts published by Folha de So Paulo, obtained from CHAVEFolha collection [10] and result spreadsheets and the software used for tests can be found in [11]. The test phase was divided in two parts. In the first part the words were translated using the automata based method and in the second part one of
134
D.P. Shibata and R.L. de Azevedo da Rocha Table 3. Word Translation
the results of the set generated by the automaton was selected based on the choosing rules. The automaton was tested with a set of 7797 words. These words were taken from journalistic articles on the themes of sports, culture, politics, technology and economy. Acronyms, names, typos and foreign words were removed from the main set since they need not follow the rules of Portuguese. Table 4 presents the results of the automaton execution which were classified as: 1. Correct: yielded and expected translations are equal. 2. Incorrect: yielded and expected translations are different but that does not affect understandability. 3. Doubt: yielded translation set include expected translation. 4. Failure: yielded and expected translations are different and that affects understandability. The same texts were tagged using the VLMC Tagger [12] and the generated 9100 pairs of words and tags were used as the input to test the translation method composed by the automaton and the choosing rules. Table 4 presents the results of grapheme-phoneme translation. Table 4. Text Translation Classification
(1) Choosing
(2) Most Probable
Correct Incorrect Total
8331 (91,55%) 769 (8,45%) 9100 (100%)
8103 (89,04%) 997 (10,96%) 9100 (100%)
The results were classified as correct if translation result equals the expected representation or incorrect if any kind of utterance was found. The test was repeated choosing the most probable sound (2) to verify how much the accuracy improved the output. The 2,5% accuracy increase over the whole set turns out to be very good since 1600 pair were classified as doubt (about 15% accuracy increase inside this group).
5
Conclusion
This paper presented an application for grapheme-phoneme translation for Portuguese based on adaptive automata, an implementation of the method described
Grapheme-Phoneme Translator for BP
135
in [5]. First tests have shown that the application is quite successful, translating words into their expected phonetic representations in 91.5% of the words tested, and getting results that were not expected but still acceptable in a large amount of the other 8.5%. The method may be adapted for other variations of Portuguese by changing the rules that define the sounds of a token. With changes on characteristics of tokens such as stress rules, generated influences and received influences the method might even be used for different languages. The accuracy rate found is quite good and it indicates the solution can be used as part of the core of a text-to-speech translator or at least as a method to guess the correct phonetic representation of words that are not previously known. There is still room to increase the accuracy by fine tuning the rules and studying characteristics that are not checked in the model such as second stress. The research should follow on with the improvement of rules used for Portuguese, the study of phonetic rules for variations of Portuguese language and the study of rules for Spanish language.
References 1. Neto, J.J.: Adaptive Automata for Context-Sensitive Languages. SIGPLAN NOTICES 29(9), 115–124 (1994) 2. Beck, J., Braga, D., Nogueira, J., Coelho, L., Dias, M.: Automatic Syllabification for Danish Text-to-Speech Systems. In: Proceedings of Interspeech 2009, Brighton, United Kingdom, September 6-10 (2009) 3. Braga, D.: Natural Language Processing Algorithms for TTS systems in Portuguese. PhD Thesis. La Coru˜ na University, Spain (2008) (in Portuguese) 4. Oliveira, C., Moutinho, L., Teixeira, A.: On European Portuguese Automatic Syllabification. In: Gonz´ alez, G., et al. (eds.) (coords), III Congreso Internacional de Fon´etica Experimental, Santiago de Compostela: Xunta de Galicia, pp. 461–473 (2007) 5. Shibata, D.P.: Tradu¸ca ˜o Grafema-Fonema para a L´ıngua Portuguesa baseada em Autˆ omatos Adaptativos, p. 91. Disserta¸ca ˜o de Mestrado - Escola Polit´ıcnica, Univesidade de S˜ ao Paulo, S˜ ao Paulo (2008) 6. Barbosa, P.A., Albano, E.C.: Brazilian Portuguese. Illustrations of the IPA. Journal of the International Phonetic Association 34(2), 227–232 (2004) 7. Shibata, D.P., Rocha, R.L.A.: An Adaptive Automata based method to improve the output of text-to-speech translators. In: Congress of Logic Applied to Technology, Santos, vol. 6 (2007) 8. Neto, P.C., Infante, U.: Gram´ atica da L´ıngua Portuguesa. 1a Edi¸ca ˜o, p. 583. Editora Scipione, S˜ ao Paulo (1997) 9. International Phonetics Alphabet, http://www.langsci.ucl.ac.uk/ipa/index.html 10. Linguateca, http://www.linguateca.pt 11. Shibata, D.P.: http://sites.google.com/site/daniloshibata/ 12. Kepler, F.N.: Um etiquetador morfo-sint´ atico baseado em Cadeias de Markov de tamanho vari´ avel, p. 58, Disserta¸ca ˜o de Mestrado Instituto de Matem´ atica e Estat´ıstica, Univesidade de S˜ ao Paulo, S˜ ao Paulo (2005)
Improvement of Inventory Control under Parametric Uncertainty and Constraints Nicholas Nechval1, Konstantin Nechval2, Maris Purgailis1, and Uldis Rozevskis1 1
University of Latvia, EVF Research Institute, Statistics Department, Raina Blvd 19, LV-1050 Riga, Latvia Nicholas Nechval, Maris Purgailis, Uldis Rozevskis
[email protected] 2 Transport and Telecommunication Institute, Applied Mathematics Department, Lomonosov Street 1, LV-1019 Riga, Latvia
[email protected]
Abstract. The aim of the present paper is to show how the statistical inference equivalence principle (SIEP), the idea of which belongs to the authors, may be employed in the particular case of finding the effective statistical decisions for the multi-product inventory problems with constraints. To our knowledge, no analytical or efficient numerical method for finding the optimal policies under parametric uncertainty for the multi-product inventory problems with constraints has been reported in the literature. Using the (equivalent) predictive distributions, this paper represents an extension of analytical results obtained for unconstrained optimization under parametric uncertainty to the case of constrained optimization. A numerical example is given. Keywords: Inventory problem, parametric uncertainty, constraints, pivotal quantity, equivalent predictive inferences.
1 Introduction The last decade has seen a substantial research focus on the modeling, analysis and optimization of complex stochastic service systems, motivated in large measure by applications in areas such as transport, computer and telecommunication networks. Optimization issues, which broadly focus on making the best use of limited resources, are recognized as of increasing importance. However, stochastic optimization in the context of systems and processes of any complexity is technically very difficult. Most stochastic models to solve the problems of control and optimization of system and processes are developed in the extensive literature under the assumptions that the parameter values of the underlying distributions are known with certainty. In actual practice, such is simply not the case. When these models are applied to solve realworld problems, the parameters are estimated and then treated as if they were the true values. The risk associated with using estimates rather than the true parameters is called estimation risk and is often ignored. When data are limited and (or) unreliable, estimation risk may be significant, and failure to incorporate it into the model design may lead to serious errors. Its explicit consideration is important since decision rules A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 136–146, 2011. © Springer-Verlag Berlin Heidelberg 2011
Improvement of Inventory Control under Parametric Uncertainty and Constraints 137
that are optimal in the absence of uncertainty need not even be approximately optimal in the presence of such uncertainty. In this paper, we propose a new approach to solve constrained optimization problems under parametric uncertainty. This approach is based on the statistical inference equivalence principle, the idea of which belongs to the authors. It allows one to yield an operational, optimal information-processing rule and may be employed for finding the effective statistical decisions for problems such as multi-product newsboy problem with constraints, allocation of aircraft to routes under uncertainty, airline set inventory control for multi-leg flights, etc. For instance, one of the above problems can be formulated as follows. An airline company operates more than one route. It has available more than one type of airplanes. Each type has its relevant capacity and costs of operation. The demand on each route is known only in the form of the sample data, and the question asked is: which aircraft should be allocated to which route in order to minimize the total cost (performance index) of operation? This latter involves two kinds of costs: the costs connected with running and servicing an airplane, and the costs incurred whenever a passenger is denied transportation because of lack of seating capacity. (This latter cost is “opportunity” cost.) We define and illustrate the use of the loss function, the cost structure of which is piecewise linear. Within the context of this performance index, we assume that a distribution function of the passenger demand on each route is known as certain component of a given set of predictive models. Thus, we develop our discussion of the allocation problem in the presence of completely specified set of predictive demand models. We formulate this problem in a probabilistic setting. Let A1, ..., Ag be the set of airplanes which company utilize to satisfy the passenger demand for transportation en routes 1, ..., h. It is assumed that the company operates h routes which are of different lengths, and consequently, different profitabilities. Let f ij( k ) ( y ) represent the predictive probability density function of the passenger demand
Y for transportation en route j, j∈{1, ..., h}, at the ith stage (i∈{1, …, n}) for the kth predictive model (k∈{1, …, m}). It is required to minimize the expected total cost of operation (the performance index) ⎤ ⎡g ∞ (k ) ⎢ − J i (U i ) = ∑ ∑ wrij u rij + c j ∫ ( y Qij ) f ij ( y )dy ⎥ ⎥ ⎢ j =1 r =1 Qij ⎦ ⎣ h
(1)
subject to h
g
j =1
r =1
∑ urij ≤ ari , r = 1, ... , g , where Qij = ∑ urij qrj , j = 1, ... , h,
(2)
Ui={urij} is the g × h matrix, urij is the number of units of airplane Ar allocated to the jth route at the ith stage, wrij is the operation costs of airplane Ar for the jth route at the ith stage, cj is the price of a one-way ticket for air travel en jth route, qrj is the limited seating capacity of airplane Ar for the jth route, ari is available the number of units of airplane Ar at the ith stage. To use the data of observations of the real airline system more effectively, the technique proposed in this paper might be employed to optimize the statistical decisions under parametric uncertainty and constraints derived from the analytical model (1-2).
138
N. Nechval et al.
2 Inference Equivalence Principle In the general formulation of decision theory, we observe a random variable X (which may be multivariate) with distribution function F(x|θ) where a parameter θ (in general, vector) is unknown, θ∈Θ, and if we choose decision d from the set of all possible decisions D, then we suffer a loss l(d,θ). A “decision rule” is a method of choosing d from D after observing x∈X, that is, a function u(x)=d. Our average loss (called risk) Ex{l(u(X),θ)} is a function of both θ and the decision rule u(⋅), called the risk function r(u,θ), and is the criterion by which rules are compared. Thus, the expected loss (gains are negative losses) is a primary consideration in evaluating decisions. We will now define the major quantities just introduced. A general statistical decision problem is a triplet (Θ,D,l) and a random variable X. The random variable X (called the data) has a distribution function F(x|θ) where θ is unknown but it is known that θ∈Θ. X will denote the set of possible values of the random variable X. θ is called the state of nature, while the nonempty set Θ is called the parameter space. The nonempty set D is called the decision space or action space. Finally, l is called the loss function and to each θ∈Θ and d∈D it assigns a real number l(d,θ). For a statistical decision problem (Θ,D,l), X, a (nonrandomized) decision rule is a function u(⋅) which to each x∈X assigns a member d of D: u(X)=d. The risk function r(u,θ) of a decision rule u(X) for a statistical decision problem (Θ,D,l), X (the expected loss or average loss when θ is the state of nature and a decision is chosen by rule u(⋅)) is r(u,θ)=Ex{l(u(X),θ)}. This paper is concerned with the implications of group theoretic structure for invariant loss functions. Our underlying structure consists of a class of probability models (X, A, P), a one-one mapping ψ taking P onto an index set Θ, a measurable space of actions (D, B), and a real-valued loss function
{
}
l (d , θ) = E x l D (d , X )
(3)
defined on Θ × D, where l D (d , X ) is a random loss function with a random variable X∈(0,∞) (or (−∞,∞)). We assume that a group G of one-one A - measurable transformations acts on X and that it leaves the class of models (X, A, P ) invariant. We ~ further assume that homomorphic images G and G of G act on Θ and D, respec~ tively. ( G may be induced on Θ through ψ; G may be induced on D through l). We shall say that l is invariant if for every (θ, d) ∈ Θ × D l ( g~d , gθ) = l (d , θ), g∈G.
(4)
A loss function, l (d , θ) , can be transformed as follows: l (d , θ) = l ( g~θˆ−1d , g θˆ−1θ) = l # (η , V ),
(5)
Improvement of Inventory Control under Parametric Uncertainty and Constraints 139
where V=V(θ, θ ) is a pivotal quantity whose distribution does not depend on un known parameter θ; η=η(d, θ ) is an ancillary factor; θ is a maximum likelihood estimator of θ (or a sufficient statistic for θ). Then the best invariant decision rule (BIDR) is given by u BIDR ≡ d ∗ = η −1 (η ∗ , θ), where η ∗ = arg inf E l # (η , V ) (6) η
and a risk function
{
}
{
{
}
}
r (u BIDR , θ) = Eθ l (u BIDR , θ) = Ev l # (η ∗ , V )
(7)
does not depend on θ. Consider now a situation described by one of a family of density functions f(x|μ,σ) indexed by the vector parameter θ=(μ,σ), where μ and σ (>0) are respectively parameters of location and scale. For this family, invariant under the group of positive linear transformations: x → ax+b with a > 0, we shall assume that there is obtainable from some informative experiment (a random sample of observations X=(X1, …, Xn)) a sufficient statistic (M,S) for (μ,σ) with density function h(m,s|μ,σ) of the form h(m, s | μ , σ ) = σ −2 h• [(m − μ ) / σ , s / σ ]
(8)
h(m, s | μ , σ )dmds = h• (v1 , v2 )dv1dv2 ,
(9)
such that
where V1=(M−μ)/σ, V2=S/σ. We are thus assuming that for the family of density functions an induced invariance holds under the group G of transformations: m→am+b, s→as (a>0). The family of density functions f(x|μ,σ) satisfying the above conditions is, of course, the limited one of normal, negative exponential, Weibull and gamma, with known index, density functions. The structure of the problem is, however, more clearly seen within the general framework. Suppose that we deal with a loss function l+(d,θ) =Ex {l D (d , X )} = ω (σ ) l (d , θ), where ω(σ) is some function of σ and ω(σ)=ω•(V2,S). In order to obtain an equivalent prediction loss function l • (d , m, s ) , which is independent on θ and has the same optimal invariant statistical decision rule given by (6), i.e., arg min l • ( d , M , S ) = d ∗ ≡ u BIDR ,
(10)
{
(11)
d
with a risk given by
}
Em, s l • (u BIDR , Μ , S ) = ω (σ )r (u BIDR , θ),
we define an equivalent predictive probability density function of a random variable X (with a probability density function f(x|μ,σ)) as
140
N. Nechval et al.
f • ( x | m, s ) =
∫∫ f ( x, v1 , v2 | m, s)h•• (v1 , v2 )dv1dv2 ,
(12)
v1 ,v2
where f ( x, v1 , v2 | m, s) = f ( x | μ ,σ ),
h•• (v1 , v 2 ) =
⎛
ω•−1 (v2 , s)h• (v1 , v2 )⎜⎜
∫∫
(13)
⎞
ω•−1 (v2 , s) h• (v1 , v2 )dv1dv2 ⎟⎟ ⎟ ⎠
⎜ v ,v ⎝1 2
−1
.
(14)
Then l • (d , m, s ) is given by
{
} ∫
l • (d , m, s ) = E x l D (d , X ) | m, s = l D (d , X ) f • ( x | m, s )dx.
(15)
x
Now the predictive loss function l • (d , m, s ) can be used to obtain efficient frequentist statistical decisions under parameter uncertainty for constrained optimization problems, where the known approaches are unable to do it.
3 Newsboy Problem with No Constraints The classical newsboy problem is reflective of many real life situations and is often used to aid decision-making in the fashion and sporting industries, both at the manufacturing and retail levels (Gallego and Moon [1]). The newsboy problem can also be used in managing capacity and evaluating advanced booking of orders in service industries such as airlines and hotels (Weatherford and Pfeifer [2]). A partial review of the newsboy problem literature has been recently conducted in a textbook by Silver et al. [3]. Researchers have followed two approaches to solving the newsboy problems. In the first approach, the expected costs of overestimating and underestimating demand are minimized. In the second approach, the expected profit is maximized. Both approaches yield the same results. We use the first approach in stating the newsboy problem. For product j, define: quantity demanded during the period, a random variable, Xj fj(xj|μj,σj) the probability density function of Xj, θj=(μj,σj) the parameter of fj(xj|μj,σj), Fj(xj|μj,σj) the cumulative distribution function of Xj, c (j1) overage (excess) cost per unit, c (j2) dj
underage (shortage) cost per unit,
inventory/order quantity, a decision variable. The cost per period is l Dj (d j , X j ) = c (j1) (d j − X j ), if X j < d j , or c (j2) ( X j − d j ), if X j ≥ d j .
(16)
Improvement of Inventory Control under Parametric Uncertainty and Constraints 141
Complete information. A standard newsboy formulation (see, e.g., Nahmias [4]) is to consider each product j’s cost function: l +j (d j , θ j )
=
c (j1)
dj
∞
−∞
dj
( 2) ∫ (d j − x j ) f j ( x j | μ j ,σ j )dx j + c j ∫ ( x j − d j ) f j ( x j | μ j ,σ j )dx j . (17)
Expanding (17) gives dj
∞
−∞
dj
l +j (d j , θ j ) = −c (j1) ∫ x j f j ( x j | μ j , σ j )dx j + c (j2) ∫ x j f j ( x j | μ j ,σ j )dx j + (c (j1) + c (j2) )d j [ F j (d j | μ j , σ j ) − c (j2) (c (j1) + c (j2) )].
(18)
Let the superscript * denote optimality. Using Leibniz's rule to obtain the first and second derivatives shows that l +j ( d j | θ j ) is concave. The sufficient optimality condition is the well-known fractile formula: F j (d ∗j | μ j , σ j ) = c (j2) (c (j1) + c (j2) ) .
(19)
d ∗j = F j−1[c (j2) (c (j1) + c (j2) ) | μ j , σ j ] .
(20)
It follows from (19) that
At optimality, substituting (19) into the last (bracketed) term in Eq. (18) gives
(
)
(c (j1) + c (j2) )d ∗j F j (d ∗j | μ j , σ j ) − c (j2) (c (j1) + c (j2) ) = 0.
(21)
Hence (18) reduces to l +j (d ∗j , θ j )
d ∗j
=
c (j2) Ex j {X j } − (c (j1)
+ c (j2) )
∫ x j f j ( x j | μ j ,σ j )dx j .
(22)
−∞
Parametric Uncertainty. Let us assume that the functional form of the probability density function fj(xj|μj,σj) is specified but its parameter θ=(μj,σj) is not specified. Let Xj=(Xj1, …, Xjn) be a random sample of observations on a continuous random variable Xj. We shall assume that there is obtainable from a random sample of observations Xj=(Xj1, …, Xjn) a sufficient statistic (Mj,Sj) for θ=(μj,σj) with density function of the form (8), h j (m j , s j | μ j ,σ j ) = σ −j 2 h• j [(m j − μ j ) / σ j , s j / σ j ],
(23)
h j (m j , s j | μ j , σ j )dm j ds j = h• j (v1 j , v2 j )dv1 j dv2 j ,
(24)
and with
where V1j=(Mj−μj)/σj, V2j=Sj/σj.
142
N. Nechval et al.
Using an invariant embedding technique (Nechval et al. [5-8]), we transform (17) as follows: l +j (d j , θ j ) = ω j (σ j )l #j (η j , V j ),
(25)
where ωj(σj)=σj, l #j (η j , V j ) =
c (j1)
η jV 2 j +V1 j
∞
−∞
η jV 2 j +V1 j
( 2) ∫ (η jV2 j + V1 j − z j ) f j ( z j )dz j + c j
∫ (z j − η jV2 j − V1 j ) f j ( z j )dz j ,
(26)
Zj=(Xj-μj)/σj is a pivotal quantity, fj(zj) is defined by fj(xj|μj,σj), i.e., fj(zj)dzj = fj(xj|μj,σj)dxj,
(27)
Vj=(V1j,V2j) is a pivotal quantity, ηj=(dj-Mj)/Sj is an ancillary factor. It follows from (25) that the risk associated with u BIDR (or η ∗j ) can be expressed as j
{
}
{
}
r j+ (u BIDR , θ j ) = Em j ,s j l +j (u BIDR , θ j ) = ω j (σ j ) E v j l #j (η ∗j , V j ) , j j where
{
}
u BIDR ≡ d ∗j = M j + η ∗j S j , η ∗j = arg min E v j l #j (η j , V j ) j ηj
{
} ∫∫ l
E v j l #j (η j , V j ) =
# j (η j ; v1 j , v2 j ) h• j (v1 j , v2 j ) dv1 j dv2 j .
(28)
(29)
(30)
v1 j ,v2 j
The fact that (30) is independent of θj means that an ancillary factor η ∗j , which minimizes (30), is uniformly best invariant. Thus, d ∗j given by (29) is the best invariant decision rule.
4 Numerical Example Complete Information. Assuming that the demand for product j, Xj, is exponentially distributed with the probability density function, fj(xj|σj)=(1/σj)exp(−xj/σj) (xj>0),
(31)
it follows from (17), (20) and (22) that l +j (d j , σ j ) = c (j1) (d j − σ j ) + (c (j1) + c (j2) )σ j exp(−d j / σ j ),
(32)
d ∗j = σ j ln(1 + c (j2) / c (j1) ) , and l +j (d ∗j , σ j ) = c (j1)σ j ln(1 + c (j2) / c (j1) ),
(33)
respectively.
Improvement of Inventory Control under Parametric Uncertainty and Constraints 143
Parametric Uncertainty. Consider the case when the parameter σj is unknown. Let Xj=(Xj1, …, Xjn) be a random sample of observations (each with density function (31)) on a continuous random variable Xj. Then Sj =
n
∑ X ji ,
(34)
i =1
is a sufficient statistic for σj; Sj is distributed with h j ( s j | σ j ) = [Γ( n)σ nj ]−1 s nj −1 exp(− s j / σ j ) ( s j > 0),
(35)
so that h• j (v 2 j ) = [Γ(n)]−1 v 2n−j 1e
− v2 j
(v2j>0).
(36)
It follows from (28) and (32) that
{
}
∞
∫
r j+ (u BIDR , σ j ) = E s j l +j (u BIDR , σ j ) = σ j l #j (η ∗j , v2 j )h• j (v2 j )dv2 j j j 0
= σ j [c (j1) (nη ∗j − 1) + (c (j1) + c (j2) )(1 + η ∗j ) − n ] ,
(37)
u BIDR = η ∗j S j , j
(38)
where
1 /( n +1)
η ∗j
⎡ (1) c (j1) + c (j2) ⎤ ⎡ c (j2) ⎤ = arg min σ j ⎢c j ( nη j − 1) + ⎥ = ⎢1 + ⎥ ηj (1 + η j ) n ⎥⎦ ⎢⎣ c (j1) ⎥⎦ ⎢⎣
− 1.
(39)
Comparison of Decision Rules. For comparison, consider the maximum likelihood decision rule (MLDR) that may be obtained from (33), u MLDR = σ j ln(1 + c (j2) / c (j1) ) = η MLDR Sj , j j
(40)
where σ j =Sj/n is the maximum likelihood estimator of σj,
η MLDR = ln(1 + c (j2) / c (j1) )1 / n . j
(41)
and u MLDR belong to the same class Since u BIDR j j & = {u j : u j = η j S j },
(42)
it follows from the above that u MLDR is inadmissible in relation to u BIDR . If, say, j j n=1 and c (j2) / c (j1) =100, we have that
144
N. Nechval et al.
rel.eff .r + {u MLDR , u BIDR ,σ j } = r j+ (u BIDR ,σ j ) r j+ (u MLDR ,σ j ) j j j j j
−1
⎛ 1 + c (j2) / c (j1) ⎞⎟ 1 + c (j2) / c (j1) ⎞⎟⎛⎜ MLDR η = ⎜ nη ∗j − 1 + − 1 + n = 0.838. j ⎜ (1 + η MLDR ) n ⎟⎠ (1 + η ∗j ) n ⎟⎠⎜⎝ j ⎝
(43)
Thus, in this case, the use of u BIDR leads to a reduction in the risk of about 16.2 % as j compared with u MLDR . The absolute risk will be proportional to σj and may be conj siderable. Equivalent Predictive Loss Function. In order to obtain an equivalent predictive loss function l •j ( d j , S j ) , which is independent on σj and has the same optimal invari-
ant statistical solution given by (29), i.e., arg min l •j (d j , S j ) = d ∗j ≡ u BIDR , j
(44)
dj
with a risk given by
{
}
E s j l •j (u BIDR , S j ) = r j+ (u BIDR ,σ j ), j j
(45)
we define (on the basis of (12)) an equivalent predictive distribution of a random variable Xj as f j• ( x j | s j ) =
xj ⎞ n + 1 ⎛⎜ ⎟ 1+ s j ⎜⎝ s j ⎟⎠
−( n+ 2)
⎛ xj ⎞ ( x j > 0) or F j• ( x j | s j ) = 1 − ⎜1 + ⎟ ⎜ s j ⎟⎠ ⎝
− ( n +1)
. (46)
Then l •j (d j , s j ) is given by
{
∞
} ∫
l •j (d j , s j ) = E x j l Dj (d j , X j ) | s j = l Dj (d j , x j ) f j• ( x j | s j )dx j 0
= (s j / n)[c (j1) (nd j s −j 1 − 1) + (c (j1) + c (j2) )(1 + d j s −j 1 ) −n ] .
(47)
Now the equivalent predictive loss function l •j (d j , s j ) can be used to obtain efficient frequentist statistical solutions under parameter uncertainty for constrained optimization problems, where the known approaches are unable to do it.
5 Newsboy Problem with Constraints Complete Information. Define wj (>0) as product j's per-unit requirement of a constrained resource, and wΣ as the maximum availability of the resource. The formulation for minimizing the total expected cost of N products subject to one capacity constraint is as follows:
Improvement of Inventory Control under Parametric Uncertainty and Constraints 145
Minimize N
∑
N
j =1
∫ 0
dj ∞
dj
N
∑
∫
[c (j1) (d j − x j ) f j ( x j | μ j ,σ j )dx j + c (j2) ( x j − d j ) f j ( x j | μ j ,σ j )dx j ]
j =1
=
∞
dj
∑
l +j (d j , θ j ) =
[c (j1)
j =1
∫
∫
F j ( x j | μ j ,σ j )dx j + c (j2) [1 − F j ( x j | μ j ,σ j )]dx j ].
−∞
(48)
dj
subject to N
∑wjd j j =1
≤ wΣ .
(49)
The above problem can be solved as follows. Compute d ∗j for each product j with Eq. (20) and check whether ∑ j w j d ∗j exceeds wΣ. If it does not, the capacity constraint is non-operative, and the optimal order quantity is d ∗j , ∀j=1(1)N. Otherwise, the constraint is set to equality and the Lagrange function is introduced. Parametric Uncertainty. In this case, the problem is as follows: Minimize the total equivalent predictive loss function N
∑
l •j ( d j , m j , s j ) =
j =1
dj
N
∑
j =1
[c (j1)
∫ (d j − x j ) f j ( x j | m j , s j )dx j •
0
∞
∫
+ c (j2) ( x j − d j ) f j• ( x j | m j , s j )dx j ] dj
N
subject to
∑ w j d j ≤ wΣ .
(50)
j =1
Now we can obtain the effective statistical solutions under capacity constraint and parametric uncertainty from solving this problem in the same manner as in the case of complete information, namely: d ∗j = F j• −1 ([c (j2) − λw j ][c (j1) + c (j2) ]−1 | m j , s j ), ∀j = 1(1) N , (51) where the value of the Lagrange multiplier λ can be determined by solving the singlevariable (λ) non-linear equation N
∑ w j F j• −1 ([c (j2) − λw j ][c (j1) + c (j2) ]−1 | m j , s j ) − wΣ = 0.
(52)
j =1
Consider, for instance, the case of the numerical example of Section 4, with N = 2, sj = s, c (j1) = c1 , c (j2) = c2 , c (j2) / c (j1) = 100, wj=1 for j∈{1, 2}. We find (with n1=n2=1 and wΣ=14s) that in this case the use of u BIDR (j = 1, 2) leads to a reduction in the risk of j (j = 1, 2). about 14 % as compared with u MLDR j
146
N. Nechval et al.
6 Conclusion In this paper, we propose a new approach to solve constrained optimization problems under parametric uncertainty. It is especially efficient when we deal with asymmetric loss functions and small data samples. The results obtained in the paper agree with the computer simulation results, which confirm the validity of the theoretical predictions of performance of the suggested approach.
References 1. Gallego, G., Moon, I.: The Distribution Free Newsboy Problem: Review and Extensions. The Journal of the Operational Research Society 44, 825–834 (1993) 2. Weatherford, L.R., Pfeifer, P.E.: The Economic Value of Using Advance Booking of Orders. Omega 22, 105–111 (1994) 3. Silver, E.A., Pyke, D.F., Peterson, R.P.: Inventory Management and Production Planning and Scheduling. John Wiley, New York (1998) 4. Nahmias, S.: Production and Operations Management. Irwin, Boston (1996) 5. Nechval, N.A., Nechval, K.N., Vasermanis, E.K.: Optimization of Interval Estimators via Invariant Embedding Technique. IJCAS (The International Journal of Computing Anticipatory Systems) 9, 241–255 (2001) 6. Nechval, N.A., Nechval, K.N., Vasermanis, E.K.: Effective State Estimation of Stochastic Systems. Kybernetes (The International Journal of Systems & Cybernetics) 32, 666–678 (2003) 7. Nechval, N.A., Nechval, K.N., Vasermanis, E.K.: Prediction Intervals for Future Outcomes with a Minimum Length Property. Computer Modelling and New Technologies 8, 48–61 (2004) 8. Nechval, N.A., Berzins, G., Purgailis, M., Nechval, K.N.: Improved Estimation of State of Stochastic Systems via Invariant Embedding Technique. WSEAS Transactions on Mathematics 7, 141–159 (2008)
Modified Jakubowski Shape Transducer for Detecting Osteophytes and Erosions in Finger Joints Marzena Bielecka1 , Andrzej Bielecki2 , Mariusz Korkosz3, Marek Skomorowski2, Wadim Wojciechowski4 , and Bartosz Zieliński2 1
3
Department of Geoinformatics and Applied Computer Science, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, Mickiewicza 30, 30-059 Cracow, Poland
[email protected] 2 Institute of Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348 Cracow, Poland {bielecki,skomorowski}@ii.uj.edu.pl,
[email protected] Division of Rheumatology, Departement of Internal Medicine and Gerontology, Jagiellonian University Hospital, Śniadeckich 10, 31-531 Cracow, Poland
[email protected] 4 Department of Radiology, Jagiellonian University Hospital, Kopernika 19, 31-531 Cracow, Poland
[email protected]
Abstract. In this paper, a syntactic method of pattern recognition is applied to hand radiographs interpretation, in order to recognize erosions and osteophytes in the finger joints. It is shown that, the classical Jakubowski transducer does not distinguish contours of healthy bones from contours of affected bones. Therefore, the modifications of the transducer are introduced. It is demonstrated, that the modified transducer correctly recognizes the classes of bone shapes obtained based on the medical classification: healthy bone class, erosion bone class and osteophyte bone class. Keywords: Syntactic method of pattern recognition, Medical imaging, Computer assisted rheumatic diagnosis.
1
Introduction
Arthritis and musculoskeletal disorders are more prevalent and frequent causes of disability than heart disease or cancer [11]. There are a number of inflammatory as well as non-inflammatory diseases within the scope of rheumatology and diagnostic radiology. It is essential to distinguish between inflammatory disorders, which can be fatal, and non-inflammatory disorders, which are relatively harmless and can occur in the majority of people aged around 65. To give a diagnosis, A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 147–155, 2011. c Springer-Verlag Berlin Heidelberg 2011
148
M. Bielecka et al.
an X-ray is taken of the patients hand and symmetric metacarpophalangeal joint spaces and interphalangeal joint spaces are analyzed [14]. Thus, the changes in border of finger joints surfaces observed on hand radiographs are a crucial point in medical diagnosis and support important information for estimation of therapy efficiency. However, they are difficult to detect in an X-ray picture when examined by a human expert, due to the quantity of joints. On the other hand, it is extremely important to diagnose pathological changes in the early stages of a disease, which means that differences in the order of 0.5mm between the contours of pathologically changed bones and unaffected ones need to be identified. The possibility of performing such analysis by a computer system is a key point for diagnosis support. Therefore, studies concerning possibilities of implementation such systems are topic of numerous publications [12,13,16] (see other references in [6]). These researches are a part of the extensive stream of studies concerning artificial intelligence methods application in medical image understanding [15].
a
b
c
d
e
Fig. 1. Healthy joint (a), bones with osteophytes (b, c) and joints with erosions (d, e) radiograph
This paper is a continuation of studies described in [2,3,4,5,6,17,18], concerning automatic hand radiographs analysis. In the previous papers the preprocessing and joint location algorithms were presented. At the beginning, the applied approach turned to be effective in about 90% of cases [6], the algorithm was then improved in [18] and efficiency at 97% was achieved. Based on those locations, the algorithm identifying the borders of the upper and lower joint surfaces was proposed [5]. The preliminary analysis of such borders due to erosions detection is studied in [2,4]. In this paper, a syntactic method of pattern recognition is applied to hand radiographs interpretation, in order to recognize erosions and osteophytes in the finger joints. Example of the healthy joint radiograph and joints with osteophytes and erosions are shown in Fig.1(a), Fig.1(b,c) and Fig.1(d,e), respectively. Possible location of the osteophytes and erosions are shown as bold line in Fig.2(a) and Fig.2(b), respectively. It is shown that, the classical Jakubowski transducer [8] does not distinguish contours of healthy bones from contours of
Application of Shape Description
a
149
b
Fig. 2. Contours with possible locations where osteophytes (a) and erosions (b) may occur marked by bold line
affected bones. Therefore, the modifications of the transducer are introduced. It is demonstrated, that the modified transducer correctly recognizes the classes of bone shapes obtained based on the medical classification: healthy bone class, osteophyte bone class and erosion bone class. The paper is organized in the following way. The shape description methodology is recalled in section 2. In section 3, Jakubowski transducer is used for bone contours analysis and the necessary modifications are introduced.
2
Shape Description Methodology
Let us recall a formalism presented in [7,9,8,10], where basic unit of the analysed pattern is one of the sixteen primitives from set PRIM, being line segments or quarters of a circle (see Fig.3a). It should be mentioned that bi-indexation enumerating primitives plays a crucial role in the contour analysis. Let us also recall definition of a contour k = p1 p2 ... pm , where p1 , p2 , ..., pm are successive primitives of the contour k. Symbols pi pi+1 denotes that pi is connected to pi+1 , such that hd(pi ) = tl(pi+1 ), where hd(pk ) and tl(pk ) corresponds to head and tail of the primitive pk (see Fig.3b). Characterological description of contour k is chain of successive primitive types defined as char(k) = si1 j1 si2 j2 ...sim jm . Moreover, Qo is defined as set of primitives from the o-th quarter, for o = 1, 2, 3, 4, therefore: Qo = {sij : (j = o) ∨ (i = 1 ∧ j = o ⊕ 1)}, where o ⊕ 1 = 1 if o = 4 and o ⊕ 1 = o + 1 otherwise.
150
M. Bielecka et al.
a
b
Fig. 3. Set PRIM (a) and construction of primitive (b)
A contour k with char(k) = v such that v ∈ Q+ i ∧(length(v) > 1∨(length(v) = 1 ∧ v ∈ Qi \ (Qi⊕3 ∪ Qi⊕1 ))) is said to be the contour from the singular quadrant ((i)-singuad for short). In other words, the (i)-singuad is a contour composed of primitives from the ith quadrant. + Given contours k , k such that char(k ) ∈ Q+ i , char(k ) ∈ Qj , and char(f irst(k )) = b ∈ / Qi , and if j = i ⊕ 2 then b ∈ Qj \ (Qj⊕3 ∪ Qj⊕1 ). If k = k k , we say that k crates the so called (i,j)-biquad with char(k) = char(k )char(k ˙ ). The first primitive of k i.e. f irst(k ) is called a switch encoded by the string ij named the basic mark. Furthermore, according to definition 10, paper [8], transducer is a 5-tuple: T = (G, Σ, Δ, δ, G0 ), where G is a finite nonempty set of states, Σ is a finite nonempty input alphabet, Δ is a finite nonempty output alphabet, G0 is a finite nonempty set of start states, G0 ⊂ G and δ is a finite subset of G × Σ ∗ × Δ∗ × G. Intuitively, if (q, u, v, q ) ∈ δ, it means that if the machine is in the state q and the string u ∈ Σ ∗ is given as an input, then the state of the machine is changed into the state q and v ∈ Δ∗ becomes the machine output.
3
Bone Contour Analysis
The transducer Tm = ({q1 , q2 , q3 , q4 }, S, {1, 2, 3, 4}, δ, {q1, q2 , q3 , q4 }), where δ is given by the graph depicted in Fig.4 was proposed by Jakubowski in [8]. If u causes the transition from the state qi to qj , i = j, then u designates the switch of an (i, j)-biquad, what simply means, that there is a switch between ith and
Application of Shape Description
151
Fig. 4. δ function of the original transducer from paper [8], Fig.14b
j th quarter. Therefore, for each analysed contour, chain of biquads is taken as the result of transition. If transducer with δ function is used in case of the bone, it usually can not distinguish the healthy bone contours from contours of the bone with osteophyte or erosion. As an example, let us consider the simplified contours presented in Fig.5. Contour presented in Fig.5(a) presents no pathological changes. However contour in Fig.5(b) is convex, what means that it contains osteophyte. On the other hand, contour in Fig.5(c) is concave, that is why it contains erosion. However, it can be easily verified, that all three contours are represented by the same biquad description 32.21, despite the fact that they represents healthy bone, bone with osteophyte and bone with erosion, respectively. Wherefore, authors had to modified the transducer to differentiate those three classes of bones. For this purpose, δ function was created as modification of the original δ function. Thus, new function behaves differently in case of primitives placed at the border of two quarters (s11 , s12 , s13 and s14 ). To better understand the changes, let assume that k is fragment of the contour which characterological description char(k) = sj s1o , where the first primitive was already classified by transducer to j th quarter and the second primitive is placed at the border of two quarters. Then, in case of function δ the biquad value is described by function:
c
b
a
d
M. Bielecka et al.
152
e
Fig. 5. Example of the healthy contour (a), contours with osteophytes (b and d) and contours with erosions (c and e). Number near primitive represent the quarter to which this primitives belong to. If the primitive first index equals 1, there are two numbers, as such primitive is placed between two quarters.
⎧ ,o = j ∨ o = j ⊕ 1 ⎨ none biquads(δ) = j(j ⊕ 1) , o = j ⊕ 2 ⎩ j(j ⊕ 3) , o = j ⊕ 3 On the other hand, modified function δ works differently for two last cases: ⎧ ,o = j ∨o = j ⊕ 1 ⎨ none biquads(δ ) = j(j ⊕ 2) , o = j ⊕ 2 ⎩ j(j ⊕ 2) , o = j ⊕ 3 It can be easily verified, that all three contours represented by the same chain of biquad 32.21 in case of δ function are represented by three different chains of biquads in case of δ function - see Tab.1. The changes in transducer were introduced due to the fact that in healthy bone, the angles between successive primitives are bigger than 90◦ , what can be observed in Fig.1a. If angles are equal or smaller than 90◦ , it means that bone contour contains pathological changes - osteophyte if an acute or right angle is inside of the bone and erosion if an acute or right angle is outside of the bone - see 5b and 5c, respectively. Original δ function does not take such regularity into account and in many cases does not differentiate contours from different bone classes.
Application of Shape Description
153
Fig. 6. δ function of the transducer, created based on the original δ function from Fig.4 Table 1. δ biquad description, δ biquad description and medical assignment of contours from Fig.5 Figure δ biquad description δ biquad description osteophyte or erosion Fig.5a 32.21 32.21 none Fig.5b 32.21 31 osteophyte Fig.5c 32.21 31.13.31 erosion Fig.5d 31 31 osteophyte Fig.5e 32.23.31 31.13.31 erosion
Moreover, it has to be stressed that introduced δ function not only differentiates two contours with the same δ biquad description, but also integrates some contours with different δ biquad description. The integration can be observed in case of Fig.5b and Fig.5d, as well as in case of Fig.5c and Fig.5e. In both pairs, biquad description generated by δ function is different for both contours, but description generated by δ is identical (see Tab.1). However, it turns out that this is an advantage, because δ function generates the same biquad description for contours with the same pathological change - either both have osteophyte, or both have erosion. Naturally, the examples in Fig.5 are quite simple, due to the fact, that they contain 45◦ , 90◦ and 135◦ angles only. However in reality, the set of angles
154
M. Bielecka et al.
between parts of contours will be much bigger. Therefore, some kind of fuzzy representation of the angles might help improve robustness and portability of the proposed methodology.
4
Concluding Remarks
As, has been presented, transducer introduced by Jakubowski and modified in this paper can be used to distinguish contour of the healthy bone from contour of the bones with erosion and osteophyte. That kind of diversification is required to build an intelligent system for joint diseases diagnosis. In such system, the most important will be analysis of the highest level features such as: the presence and location of osteophyte, the presence and location of erosion and joint space narrowing. The first two features can be described using a special algebraic approach described in [1], what will be the topic of the next publication. To recapitulate, the final system will be hierarchical one, with the following levels (starting from the lowest to highest level): preprocessing [6,17,18], contour shape description and joint space width analysis [2,4], algebraic language for coding highest level features in syntactic way and expert system to diagnose joint diseases. It has to be noted, that the system will be used as an aid in radiological diagnosis of the hand radiographs.
References 1. Bielecka, M.: Syntactic segmentation of graph function type curves. Machine Graphics and Vision 16, 39–55 (2007) 2. Bielecka, M., Bielecki, A., Korkosz, M., Skomorowski, M., Wojciechowski, W., Zieliński, B.: Application of shape description methodology to hand radiographs interpretation. In: Bolc, L., Tadeusiewicz, R., Chmielewski, L.J., Wojciechowski, K. (eds.) ICCVG 2010. LNCS, vol. 6374, pp. 11–18. Springer, Heidelberg (2010) 3. Bielecka, M., Skomorowski, M., Bielecki, A.: Fuzzy syntactic approach to pattern recognition and scene analysis. Intelligent Control Systems and Optimization, Robotics and Automation 1, 29–35 (2007) 4. Bielecka, M., Skomorowski, M., Zieliński, B.: A fuzzy shape descriptor and inference by fuzzy relaxation with application to description of bones contours at hand radiographs. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 469–478. Springer, Heidelberg (2009) 5. Bielecki, A., Korkosz, M., Wojciechowski, W., Zieliński, B.: Identifying the borders of the upper and lower metacarpophalangeal joint surfaces on hand radiographs. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010. LNCS, vol. 6113, pp. 589–596. Springer, Heidelberg (2010) 6. Bielecki, A., Korkosz, M., Zieliński, B.: Hand radiographs preprocessing, image representation in the finger regions and joint space width measurements for image interpretation. Pattern Recognition 41(12), 3786–3798 (2008) 7. Jakubowski, R.: Syntactic characterization of machine parts shapes. Cybernetics and Systems 13, 1–24 (1982) 8. Jakubowski, R.: Extraction of shape features for syntactic recognition of mechanical parts. IEEE Transactions on Systems, Man and Cybernetics 15(5), 642–651 (1985)
Application of Shape Description
155
9. Jakubowski, R.: A structural representation of shape and its features. Information Sciences 39, 129–151 (1986) 10. Jakubowski, R., Bielecki, A., Chmielnicki, W.: Data structure for storing drawing being then analysed for purposes of CAD. Archiwa Informatyki Teoretycznej i Stosowanej 1, 51–70 (1993) 11. Liang, M., Esdaile, J., Klippel, J., Dieppe, P.: Impact and Cost Effectiveness of Rheumatologic Care in Rheumatology. Mosby International, London (1998) 12. Ogiela, M.R., Tadeusiewicz, R., Ogiela, L.: Image languages in intelligent radiological palm diagnostics. Pattern Recognition 39, 2157–2165 (2006) 13. Sharp, J., Gardner, J., Bennett, E.: Computer-based methods for measuring joint space and estimating erosion volume in the finger and wrist joints of patients with rheumatoid arthritis. Arthritis & Rheumatism 43(6), 1378–1386 (2000) 14. Szczeklik, A., Zimmermann-Górska, I.: Injury Disease (in Polish). Medycyna Praktyczna, Warszawa (2006) 15. Tadeusiewicz, R., Ogiela, M.R.: Medical image understanding technology. Studies in fuzziness and soft computing. Springer, Heidelberg (2004) 16. Tadeusiewicz, R., Ogiela, M.R.: Picture languages in automatic radiological palm interpretation. International Journal of Applied Mathematics and Computer Science 15(2), 305–312 (2005) 17. Zieliński, B.: A fully-automated algorithm dedicated to computing metacarpophalangeal and interphalangeal joint cavity widths. Schedae Informaticae 16, 47–67 (2007) 18. Zieliński, B.: Hand radiograph analysis and joint space location improvement for image interpretation. Schedae Informaticae 17/18, 45–61 (2009)
Using CMAC for Mobile Robot Motion Control Kristóf Gáti and Gábor Horváth Budapest University of Technology and Economics, Department of Measurement and Information Systems Magyar tudósok krt. 2. Budapest, Hungary H-1117 {gatikr,horvath}@mit.bme.hu http://www.mit.bme.hu
Abstract. Cerebellar Model Articulation Controller (CMAC) has some attractive features: fast learning capability and the possibility of efficient digital hardware implementation. These features makes it a good choice for different control applications, like the one presented in this paper. The problem is to navigate a mobile robot (e.g a car) from an initial state to a fixed goal state. The approach applied is backpropagation through time (BPTT). Besides the attractive features of CMAC it has a serious drawback: its memory complexity may be very large. To reduce memory requirement different variants of CMACs were developed. In this paper several variants are used for solving the navigation problem to see if using a network with reduced memory size can solve the problem efficiently. Only those solutions are described in detail that solve the problem in an acceptable level. All of these variants of the CMAC require higher-order basis functions, as for BPTT continuous input-output mapping of the applied neural network is required. Keywords: CMAC, recurrent neural network, control, BPTT.
1
Introduction
The Cerebellar Model Articulation Controller is a special neural network architecture originally proposed by James S. Albus [1]. The network has some attractive features like fast convergence, local approximation capability and the possibility of efficient digital hardware implementation. Because of these features the CMAC is often used in control applications [2] among other areas like image and signal processing, pattern recognition and modeling. This paper deals with a navigation problem. It presents a solution for mobil robot motion control, implemented with a CMAC network. This is a highly nonlinear problem, which is hard to solve with classical control methods. There are many articles about this problem e. g. [11]. The question is if the advantageous properties of CMAC can be utilized in this complex navigation problem. To answer this question it should also be noted that despite the attractive features CMAC has some drawbacks. The most serious one is that its memory complexity may be huge, and that concerning its function approximation capability it may be inferior to an MLP. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 156–166, 2011. c Springer-Verlag Berlin Heidelberg 2011
Using CMAC for Mobile Robot Motion Control
157
Many solutions were suggested on both problems. Hash-coding [1],[3],[5] kernel CMAC [4],[5], fuzzy CMAC [6] and SOP-CMAC [8] are some ways for reducing memory complexity. Weight-smoothing [4] and higher-order CMACs [9],[10] are proposed for improving function approximation capability. The paper is organized as follows. In Section 2 the basic principle of BPTT is summarized, in Section 3 the classical CMAC is presented, in Section 4 the extensions and variants of the CMAC are presented, while in Section 5 the partial derivatives are determined. Section 6 describes the system and the training in details. The results may be found in Section 7, and conclusions are drawn in Section 8.
2
Backpropagation Through Time (BPTT)
BPTT is an approach proposed for training recurrent neural networks [12]. As a recurrent net is a dynamic network, it needs special training algorithms where the temporal behaviour of the network must be taken into consideration. The basic idea of BPTT is that the recurrent network is unfolded in time - as it can be seen in Fig. 1.[12] - resulting in a many-stage static one, where this static network basically can be trained using classical backpropagation algorithm. As the operation of the recurrent network is considered in discrete time steps, the number of stages equals to the time steps required for the operation of the network i.e. for determining the number of stages of the static network the time window of the operation of the recurrent network must be fixed. One constraint must be noticed. As the number of weights in the unfolded static network is increased, where in the static network more weights are used instead of a single weight of the original network, these weights must be modified simultaneously and with the same amount, as they represent the same physical weight in different time steps.
Fig. 1. Simple network with feedback a.) and its unfolded equivalent b.)
3
Classical CMAC
CMAC is a basis function network where finite-support basis functions are used. The basis functions are applied in the input space in predefined positions and the
158
K. Gáti and G. Horváth
supports of the basis functions are fixed-size closed intervals - or in multidimensional cases - fixed-size hypercubes.The classical CMAC applies rectangular basis functions that take constant value over the hypercube and zero elsewhere.The hypercube is often called the receptive field of the basis function. The network has two layers. The first layer performs a fixed nonlinear mapping, which implements the basis functions. The network output is calculated in the second layer as a weighted sum of the basis function outputs. Only the weights are trainable in the network. The fixed first layer creates a binary vector called association vector, which consists of the outputs of the basis functions. If the input point, x ∈ N , is in the receptive field of a basis function then the corresponding element in the association vector will be 1, otherwise it will be 0. The width of the receptive field is finite, controlled by the generalization parameter of the CMAC. This is denoted by C. The basis functions are arranged in overlays. An overlay is a set of basis functions, which covers the full input space, without any gap and overlap. Hence the number of overlays equals to the number of the activated basis functions. In case of an N dimensional problem this is C N . N The number of required basis function is (R + C − 1) , here R means the size of the input space. This number can be enormous in a real world application, for example if R = 1024 and N = 10, then the required number of basis functions is ∼ 2100 which cannot be implemented. To reduce the number of basis functions Albus proposed a way of using only C overlays, however even with this reduction the network could need extremely large weight-memory that is rather hard or even impossible to implement [1]. The second layer of the CMAC calculates the output y ∈ R of the network as a scalar product of the association vector a and the weight vector w: y(x) = a(x)T w = wi (1) i:ai =1
Because of the binary basis functions the product can be replaced by the sum of weights corresponding to the activated basis functions. The weights can be trained using the LMS rule in Eq. 2. Δwi = μ(yd − y), i : ai = 1
(2)
where yd is the desired output of a training data point, and μ is the learning rate.
4 4.1
Variants of the CMAC Higher-Order CMAC
For BPTT training the binary (rectangular) basis functions are not adequate, because BPTT training needs the derivative of the basis functions. Lane et al.
Using CMAC for Mobile Robot Motion Control
159
proposed the CMAC with B-Spline basis functions in [9]. The B-Splines are especially well suited for the CMAC with finite-support basis functions as the B-Splines are non-zero only in a finite and closed interval. Further advantages are the improved performance and the possibilty of training continuous functions. The main disadvantage is the loss of multiplication-free structure, as the association vector is not binary anymore. There are other type of basis functions for example Gaussian, see [10]. 4.2
Kernel CMAC
CMAC can be interpreted as a kernel machine [4], where instead of using directly the basis functions, we use the so called kernel functions that are constructed easily from the basis functions. In a Kernel CMAC(KCMAC) the memory complexity is upper bounded by the number of training points independently of the dimension of the input space and the number of basis functions [4],[5]. If M is the number of basis functions, and P is the number of training samples, then the input-output mapping of a basis function network based on the basisfunction representation is: y(x) =
M
wj ϕj (x) = wT ϕ(x)
(3)
j=1 T
where ϕ(x) = [ϕ1 (x), ϕ2 (x), ..., ϕM (x)] is a vector formed from the outputs of the basis functions for the input x and w is the weight vector. The same mapping can be described using the kernel function representation as: y (x) =
P
αk K (x, x (k))
(4)
k=1
It is also a weighted sum of nonlinear functions where the K(x, x(k)) = ϕT (x)ϕ(x(k)), k = 1, ..., P functions are called kernel functions defined by scalar products, and the αk coefficients serve as the weight values. In kernel CMAC the kernel functions are defined as K(x, x(k)) = aT (x) · a(x(k))
(5)
where a(x) is the association vector for the discrete input x and the response of a CMAC for a given x can be written as:
where
y(x) = aT (x)AT (AAT )−1 yd = aT (x)AT α = k T (x)α
(6)
α = (AAT )−1 yd
(7)
Here A is a P ×M matrix constructed from the association vectors of the training T points, A = [a1 (x), ..., aP (x)] . In kernel representation the components of α are considered as the weight values, so instead of using M weights, here only P weights will be used. As in multidimensional cases P << M this is a great reduction of memory complexity.
160
4.3
K. Gáti and G. Horváth
Fuzzy CMAC
Fuzzy CMAC (FCMAC) is a very similar extension of the CMAC as the KCMAC, nevertheless the mathematical foundation of the two versions is completely different. From the view of FCMAC the creation of the association vector can be considered as the calculation of the strength of a fuzzy rule’s IF part, and the calculation of the output can be interpreted as a COG (Center of Gravity) defuzzyfication method. In FCMAC the first difference from KCMAC, is the selection of the basis function centers and the number of the basis functions [7]. These methods have a great diversity [7]. The placement of the basis functions is the formulation of the rule’s condition part. The second difference of the FCMAC from the KCMAC is the calculation of the output. While KCMAC uses a weighted sum see Eq. 4, FCMAC calculates a weighted average of the basis functions: K K y= wj · φj (x) φj (x) (8) j=1
j=1
where ϕj (x) is the value of the j-th basis function at the input point x, which may be interpreted as the firing strength of the j-th rule. The j-th weight is the value of the j-th rule’s consequense part.
5
Calculating the Partial Derivatives
Because of the BPTT training the derivatives of the network output against its input are required. Basically this means the following: ∂y ∂(wT a(x)) ∂a(x) = = wT · ∂xi ∂xi ∂xi
(9)
That means that for obatining the derivative of the output the derivative of the basis functions must be calculated. For this reason we must use higher-order CMACs, because the derivative of the binary basis function does not exist. This derivative may be calculated by the equation in Eq. 10 [9]. Bin (t) =
n n n−1 · Bin−1 (t) − · Bi+1 (t) ti+n − ti ti+n+1 − ti+1
(10)
where Bin is the i-th B-Spline of order n. If Fuzzy CMAC is used the calculation of the derivatives is different because of the different output calculation scheme, see Eq. 11. K
∂y = ∂xk
i=1
wi ϕi (x) ·
K j=1
ϕj (x) −
K j=1
K i=1
wi ϕi (x) ·
2
ϕj (x)
K j=1
ϕj (x) (11)
Using CMAC for Mobile Robot Motion Control
6
161
The Experimental System
The setup of the system can be seen in Fig. 2. The CMAC used to control the mobile robot gets the position of the robot (x, y), and the angle between its direction and the x axis (θ). In fact, the network uses the sine and cosine of the angle, because the angle itself would cause a discontinuous function, which cannot be learned correctly by neural networks. The model of the robot is given by Eq. 12, where the steering wheels angle is generated by the CMAC. R = sinc(γi /2) · v · t, arg(R) = θi + γi /2 xi+1 = xi + R · cos(arg(R)), yi+1 = yi + R · sin(arg(R)) θi+1 = θi + γi
(12)
where R is the shift vector, γi is the angle of the steering wheel, v is the speed of the car. t is the time of delay between modifications of the angle of the steering wheel.
Fig. 2. The set-up of the system
The generation of a given path runs until the error between the desired state and the actual state decrease. It must have been handled if the destination were behind the initial state, because in this situation the first step has increased the error, so without further instructions these routes would not be learned, thus a minimal number of steps was added to the network as constraint for these cases. If training was done starting from many different initial states, then because of the local approximation capability of CMAC needlessly long paths could arised. For this reason another CMAC was added to the system, which controlled the maximum number of steps to be taken by the robot. See Section 7.1 for the generation of the training set. The input space was divided into two regions as can be seen on Fig. 3. The arrow specifies the destination state for the mobile robot. The creation of the No Turn Region was necessary because the paths started from this region could
Fig. 3. The division of the input space into regions
162
K. Gáti and G. Horváth
not be learned by the CMAC. The goal for paths, starting from this region, was to reach the dashed line, see Fig. 3. After the robot has stopped the weights must be updated during training. The training used backpropagation through time (BPTT). The weight updating equations for the CMAC in the i-th step are in Eq. 13. Δwi = Ti+1 vi · a(xi , yi , θi ) Here i =
∂x ∂y ∂θ ∂xi ∂yi ∂θi
(13)
T
is the derivative of the error with respect to the T ∂yi+1 ∂θi+1 states, vi = ∂x∂γi+1 is the derivative of the states with respect to ∂γi ∂γi i the output of the CMAC, the angle of the steering wheel. After the value of the weight modification is calculated the error must be backpropagated. This is done by Eq. 14.
i = Si i+1 + Ti+1 vi · qi (14)
Here the matrix of Si is the derivative of the states with respect to the previous T ∂γi ∂γi ∂γi states, see Eq. 15. And qi = ∂x is the derivatives of the network ∂y ∂θ i i i with respect to the actual state. ⎛ ∂xi+1 ∂yi+1 ∂θi+1 ⎞ ∂x
i ⎜ Si = ⎝ ∂x∂yi+1 i
∂xi+1 ∂θi
∂xi ∂xi ∂yi+1 ∂θi+1 ∂yi ∂yi ∂yi+1 ∂θi+1 ∂θi ∂θi
⎟ ⎠
(15)
Weight updating is done only if the error is backpropagated. The weights are updated using Eq. 13 and Eq. 16. Δw = μ ·
sn
Δwi
(16)
i=1
Here sn means the number of steps taken by the car, which equals to how many times the CMAC set the angle of the steering wheel. An example for the error during the training may be seen in Fig. 4. The dashed line, the dotted line, the dashed-dotted line and the solid line are the errors for
Fig. 4. The path of the robot during training and the errors which was calculated during the backpropagation
Using CMAC for Mobile Robot Motion Control
163
the x, the y, the θ axis and their aggregated, respectively. On the right side of the figure the gray colored arrows show the direction of the weight modification and the black arrows show the path of the car. The destination state is at the (100,100) point, and the desired angle is π.
7
Experimental Results
The size of the input space was 300 for all dimensions, where the input components were x, y, cos(θ), sin(θ). 7.1
Training of the Maximum Step Number
In this case a training sample was an initial state as input and the number of steps of the corresponding trained path as desired output. Every initial state was trained until the distance between the destination and the end state got smaller then 0.3 and the error of the angle got smaller than 5◦ . Due to the separation of the input space, see Fig. 3, two CMACs must have been constructed and the training samples corresponding to the different regions were trained separately. An example from the No Turn Region is drawn in Fig. 5, where the x, y is plotted, and the angle is 3π/4.
Fig. 5. The maximum number of steps if the angle is fixed
An FCMAC was trained with C = 70 and the center selection method was used from [6]. This method selects a training point as new center if the maximum activation value of the existing centers are smaller than a prespecfied threshold. Higher limit causes more basis functions but better performance. This limit was set to 90% of the maximum value of the basis functions, which is quite strong. 3000 training samples were generated for each region, and as it was predictable almost all training samples were selected as basis function center, exactly 2818 were selected by the FCMAC for the No Turn Region. 7.2
Training the Driving Using a KCMAC
For the driving the input space was not separated as shown in Fig. 3. The position of the initial states was generated with polar coordinates with the radius r ∈ [0, 150] and angle ϑ ∈ [0, 2π]. The angle of the initial states was generated with θ ∈ [0, 2π]. 550 initial states were generated. An actual state of the robot
164
K. Gáti and G. Horváth
was a training point candidate. The training points from these candidates were selected similarly as in the previous section, the limit was set to 60% in this case. 60 epochs were trained with the network, and in every epoch from all initial states the actual path were generated and the error was backpropagated. The learning rate was set for every backpropagation based on the error between the difference of the destination state and the end state of the path. The resulting 3 element vector is compared with the vectors in 1. If all components are smaller than in Table 1, the learning rate may be set for the network. The smallest must be selected from these options. For example if the error is (5.2, 2.7, π/30) than KCMAC will use 10−3 and FCMAC will use 10−4 as learnig rates respectively. The size of the input space was 300 and the generalization parameter was set to C = 40 for all dimensions. The kernel functions was set to 6-th order B-Splines. After the training 46780 training points were selected as kernel function center. This number could be decreased by increasing the generalization parameter or decreasing the limit of the training point selection. Some sample paths may be seen on Fig. 6a. Table 1. The learning rates corresponding to the error of the last state of a given path Value of errors KCMAC FCMAC 10−2 10−2 Default
−2 10 10−3 30 30 π/4
−3 10 10−4 12 12 π/18
−4 10−5 3 3 π/45
10−6 0.6 0.6 π/180 10 10−7
(a) KCMAC
(b) FCMAC
Fig. 6. Path generated from different initial states controlled by KCMAC and FCMAC
7.3
Training the Driving Using a FCMAC
When the FCMAC was used the general setting was C = 70 for all dimensions, the learning rates are in Table. 1. The size of the input space, and the basis functions were the same as in the KCMAC. The robot was trained from 550 initial states for 50 epochs. Sample paths may be seen on Fig. 6b. 35672 training points were selected as basis function centers. The comparison of the methods used in this paper is presented in Table 2.
Using CMAC for Mobile Robot Motion Control
165
Table 2. The comparison of the different CMACs Type of CMAC KCMAC FCMAC Complexity(Number of weights) 46780 35672 Number of epochs 60 50 Initial states 550 550 Value of C 40 70 Type of basis- kernel functions 6th order B-Spline 6th order B-Spline
8
Conclusions
In this paper different types of CMACs were used for robot motion control. Mainly two types were investigated, the kernel and fuzzy CMAC. Solutions for CMAC with hash-coding and SOP-CMAC was not adequate. Perhaps the parameters were not tuned correctly in the latter cases. However in the case of FCMAC and KCMAC the results was quite satisfying, still improving the solutions is possible. The number of initial states, the value of the learning rate may have been chosen better, but the overall performance could not be improved too much. The robot motion control is a highly nonlinear control problem, which makes it suitable for testing neural networks. The main advantages of the CMAC, the fast convergence and local approximation, were exploited, becuase only 50 epochs were needed for the training which is very fast compared to an MLP where thousands of epochs were required [11]. The disadvanteges of the network, like the huge memory requirement, made impossible to implement the task with a classical CMAC. This work is connected to the scientific program of the "Development of qualityoriented and cooperative R+D+I strategy and functional model at BME" project. This project is supported by the New Hungary Development Plan(Project ID: TÁMOP-4.2.1/B-09/1/KMR-2010-0002).
References 1. Albus, J.S.: A New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC). Transaction of the ASME, 220–227 (September 1975) 2. Kraft, L.G., Miller, W.T., Dietz, D.: Development and application of CMAC neural network-based control. In: White, D.A., Sofge, D.A. (eds.) Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pp. 215–232. Van Nostrand, New York (1992) 3. Wang, Z.-Q., Schiano, J.L., Ginsberg, M.: Hash Coding in CMAC Neural Networks. In: Proc. of the IEEE International Conference on Neural Networks, Washington, USA, vol. 3, pp. 1698–1703 (1996) 4. Horváth, G., Szabó, T.: Kernel CMAC with improved capability. IEEE Trans. Sys. Man Cybernet. B 37(1), 124–138 (2007)
166
K. Gáti and G. Horváth
5. Horváth, G., Gáti, K.: Kernel CMAC with Reduced Memory Complexity. In: Proceedings of the 19th International Conference on Artificial Neural Networks, Limassol, vol. I, pp. 698–707 (2009) 6. Nie, J., Linkens, D.A.: FCMAC: a fuzzified cerebellar model articulation controller with self-organizing capacity. Automatica 30(4), 655–664 (1994) 7. Mohajeri, K., Zakizadeh, M., Moaveni, B., Teshnehlab, M.: Fuzzy CMAC Structures. In: Proc. IEEE 2009 Int. Conf. on Fuzzy Systems (2009) 8. Lin, C.S., Li, C.K.: A low-dimensional-CMAC-based neural network. In: Proceedings of IEEE Conference on Systems, Man and Cybernetics, vol. 2, pp. 1297–1302 (1996) 9. Lane, S.H., Handelman, D.A., Gelfand, J.J.: Theory and development of higherorder CMAC neural networks. In: IEEE Contr. Syst. Mag., pp. 23–30 (April 1992) 10. Chiang, C.-T., Lin, C.-S.: CMAC with general basis functions. Neural Networks 9(7), 1199–1211 (1996) 11. Nguyen, D.H., Widrow, B.: Neural networks for self-learning control systems. IEEE Control Systems Magazine, 18–23 (April 1990) 12. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, ch. 8. MIT Press, Cambridge (1986)
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing Pierre Buesser, Fabio Daolio, and Marco Tomassini Faculty of Business and Economics, Department of Information Systems, University of Lausanne, CH - 1015 Lausanne, Switzerland {pierre.buesser,fabio.daolio,marco.tomassini}@unil.ch
Abstract. We study the robustness of Barab´ asi-Albert scale-free networks with respect to intentional attacks to highly connected nodes. Using the simulated annealing optimization heuristic, we rewire the networks such that their robustness to network fragmentation is improved but without changing neither the degree distribution nor the connectivity of single nodes. We show that simulated annealing improves on the results previously obtained with a simple hill-climbing procedure. We also introduce a local move operator in order to facilitate actual rewiring and show numerically that the results are almost equally good. Keywords: robustness, optimization.
1
simulated
annealing,
scale-free
networks,
Introduction
Networks are ubiquitous in society and in the last decade their structure and the dynamics of phenomena taking place on a network substrate have been intensively studied, thanks to the availability of large data sets [10]. One extremely important aspect of a network is its capability to withstand failures and fluctuations in the functionality of its nodes and links. This is intuitively clear for transportation, communication, and power networks, but it is also essential for social, economical, and biological ones. For example, the failure of a gene activation node in a genetic regulatory network may lead to a cascade of failures and, ultimately, to death or illness. We are witnessing how the failure of some parts of complex financial and economical networks may lead to economic crises, and knowledge of the structure of a criminal network, for instance, may give authorities better means to control and destroy it. Of course, failures may occur in many different ways and to a different degree, depending on the complexity of the system under examination. Therefore, there are potentially many ways in which the functionality of a network can be impaired partly or totally. However, it is certainly useful to consider simple models of failures first. This is precisely what R. Albert et al. did in 2000 [2]. They abstracted away all other possible complexities and details and only studied the influence of the network topology on the results of the attack. They considered two types of attacks to nodes, but attacks to links may be studied analogously. Attacks can be either random, i.e. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 167–176, 2011. c Springer-Verlag Berlin Heidelberg 2011
168
P. Buesser, F. Daolio, and M. Tomassini
any node may be shut with the same uniform probability, or nodes can be attacked as a function of their connectivity, i.e. their degree. As model networks they used Erd¨os-R´eny random graphs [4] and scale-free random graphs of the Barab´asi-Albert type [1]. While in the former all nodes have a degree close to the mean, as the degree is Poisson-distributed, in the latter the distribution is right-skewed: most nodes have a small degree but there is a significant number of nodes of high degree called hubs. It turns out that under random attacks degreeinhomogeneous scale-free graphs are much more robust than random ones and one has to remove a significant fraction of nodes before the graph falls apart and fragments itself into separated components. However, this tolerance against random attacks or failures comes at a price as scale-free graphs are much more vulnerable to the removal of nodes according to node’s degree. In other words, in a scale-free graph if the nodes are removed in decreasing order of degree, starting with the most connected ones, then the network falls apart very quickly because those highly connected nodes are the ones that held the network together. On the other hand, in Erd¨ os-R´eny random graphs degree fluctuation is very limited and thus targeted attacks are similar to random ones. The striking conclusions of Albert’s et al. pioneering investigation were followed by a number of other papers dealing with the issue. The vulnerability of several real-life networks was studied as well, including portions of the Internet and biological nets [5,9,15,7]. Starting from the above topological consideration, one strand of research deals with the following issue: if we were to design a network for some task, how would we go about to make it robust against some kind of failures? This is an optimal design problem which is common in engineering. On the other hand, we could also be given a network and be asked how to make it more robust without altering its mean degree or even its degree distribution. These constraints might arise from real-life conditions such as existing nodes and connections that cannot be changed. The present paper goes in this last direction. Schneider et al. [13,12] studied this problem using a simple hill-climbing procedure for the optimization. In the present investigation we extend that work in two ways: first, we use simulated annealing as an optimization heuristic, and second, in a variant of the optimization process, we restrict network changes to the neighborhoods of the concerned nodes, keeping the changes local as much as possible. Simulated annealing is a far better optimization heuristic than hill-climbing for highly multimodal search spaces, and local changes should be easier to do in real networks. The paper is organized as follows. In the next section we summarize the attack types, the network topologies used, and the robustness measures. This is followed by Sect. 3 in which hill-climbing and simulated annealing heuristics for the optimization of robustness are presented. In Sect. 4 we discuss our findings and compare them with previous results. Finally, we give our conclusions.
2
Model Networks and Their Tolerance to Damage
We focus the present study on targeted attack on scale-free networks. Targeted attacks are more interesting since scale-free networks are fragile under this type
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing
169
of intentional damage. Moreover, scale-free, or at least networks having a longtailed degree distribution, have been found among many important networks in society and in biological systems [10,3], which means that they are also important in practice. We now describe the network construction, the attack protocol, and the robustness measure used. Scale-Free Network Construction. To construct scale-free graphs, we use the Barab´asi-Albert model [1], which we briefly summarize. The model is a growing one: it starts with a small clique (a completely connected graph) of N0 nodes and M0 edges. At each successive time step, a new node is added such that its m ≤ N0 edges link it to m nodes already in the graph. When the nodes to which the new node connects are chosen, it is assumed that the probability π that a new node will be connected to node i depends on the degree ki of i, such that, nodes that already have many links, are more likely to be chosen over those that have few. This is called preferential attachment and is an effect that has been observed in real networks, for example in references to web pages or to fundamental scientific articles among many others. Although other subtle effects may actually enter into the picture of network growing, the “rich gets reacher” metaphor is a simple and rather appropriate one for a first approximation. The probability π(ki ) for a new node to create an edge to node i of degree ki already in the graph is given by ki π(ki ) = , j kj where the sum is over all nodes already in the graph and serves to normalize the probability value. Iterating this process, at time step t the graph will have Nt = N0 +t vertices and Mt = M0 + mt edges. Because of preferential attachment, the number of nodes with comparatively high degree should intuitively increase with increasing t. And indeed, Barab´ asi and Albert have shown that such a growing graph evolves into a stationary scale-free network with a power-law probability distribution of the vertex degree p(k) ∼ k −γ , with γ 3. Intentional Attack Protocol. The malicious attack protocol follows [7,13]. It works in this way: at each time step, network nodes are sorted in decreasing degree order and the highest degree node is removed together with all its links. After removing that node, vertices degree is re-evaluated and the list is sorted again, since some positions might have changed as a consequence of the attack, and the process continues with the first node of the new list until the network is constituted of isolated nodes. Network Robustness Measures. There are several possibilities to measure the amount of damage inflicted to the network. Two widely used ones are network fragmentation and network efficiency. As nodes and their links are removed successively, one can keep track of the size of the relative size of the largest connected components. At the beginning, for an initially connected graph, this
170
P. Buesser, F. Daolio, and M. Tomassini
number is equal to one and it decreases as soon as the fragmentation process leads to separate connected components. Here we use the measure R proposed by Schneider et al. [13]: R=
N 1 s(Q), N −1 Q=0
where s(Q) is the fraction of nodes in the largest connected cluster after removing Q nodes and N is the number of nodes in the network. Clearly, the larger R, the more robust the network with respect to fragmentation. Another frequently used measure is based on the Latora’s and Marchiori’s efficiency of the communication between nodes i and j in a graph G, which is defined as eij = 1/dij , where dij is the shortest path between vertices i, j in G. If nodes i and j are unconnected, then dij = ∞, which gives eij = 0. The average global efficiency of G is then defined as [9]: E(G) =
1 1 eij = d−1 ij . N (N − 1) N (N − 1) i=j
i=j
Pairs of nodes belonging to different components have eij = 0 and thus do not contribute to the sum. Larger values of E mean that the network remain functional from the point of view of communication between nodes. In the present work we shall study robustness as expressed by the R value; the effects of attacks on E are being studied but results are left for a future communication.
3
The Optimization Process
As we said in the introduction, we shall optimize scale-free networks with respect to the fragmentation caused by intentional attacks on the most connected nodes while keeping the original scale-free degree distribution and thus also the same average degree. The idea follows Schneider et al. [13] and consists in systematically rewiring pairs of edges such that changes that improve the R factor are kept, whereas moves that worsen R are discarded. In other words, the local move operator is a 2-exchange swap operator. The swap operator is depicted in Fig. 1 using two small graphs for the sake of illustration. This leads immediately to the following local search algorithm 1 which is called first improvement local search. In algorithm 1, g and g ∗ are graphs belonging to the set S of the possible graphs with N vertices and M edges, whose total number, including isomorphic graphs, is (N (N − 1)/N )!/(N (N − 1)/2 − M )! by simple counting arguments, M being determined by the degree sequence {k1 , k2 , . . . , kN }. This algorithm gave rather good results by using multi-starting configurations and a lot of computing time [13] but cannot avoid being trapped in local optima. It seems very likely that the fitness landscape of the optimization problem
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing
171
Fig. 1. Explanation of the edge swap operation. Two edges are chosen at random in the original graph (thick lines, left image) and they are swapped as indicated in the right image, giving rise to a new graph with the same degree distribution.
Algorithm 1. First-Improvement Build graph g ∗ ∈ S repeat Compute R(g ∗ ) Choose edges eij , ekl uniformly at random and swap them: g ← g∗ Compute R(g) if R(g) > R(g ∗ ) then g∗ ← g end if until g ∗ is a local optimum
should be a highly multimodal one and thus being trapped in inferior local minima is a common occurrence. In these conditions, a better heuristic is provided by simulated annealing. Simulated annealing works in the same manner as the hillclimbing procedure above, except that inferior configurations are not discarded with probability one; rather, they are accepted with some positive probability that depends on the cost function difference Δ(R) = |R(g) − R(g∗)| between the old and the new configuration, and on a parameter T called “temperature” by analogy with classical physical systems obeying the Maxwell-Boltzmann statistics [8]. The higher the difference Δ(R) and the lower the temperature T , the more unlikely the transition g ← g∗ is. The probabilistic acceptance criterion together with a temperature lowering schedule are the devices that allow the search to escape from local optima. At the beginning, when T is high, this capability is maximal, whereas towards the end, when T → 0, it becomes more difficult to jump out of a local optimum and the system tends to reach equilibrium. The process may be described by the pseudo-code 2.
172
P. Buesser, F. Daolio, and M. Tomassini
Algorithm 2. Simulated Annealing Build graph g ∗ ∈ S Choose an initial temperature T repeat Nsteps = 0 repeat Nsteps = Nsteps + 1 Compute R(g ∗ ) Choose edges eij , ekl uniformly at random and swap them: g ← g∗ Compute R(g) if R(g) > R(g ∗ ) then g∗ ← g else g ∗ ← g, with probability exp(−Δ(R)/T ) end if until Nsteps > Nmax or g ∗ is a local optimum Lower T until T < Tmin
4
Computational Results
We used the simulated annealing heuristic described in the previous section to optimize scale-free networks of increasing sizes N from N = 100 up to N = 300. We used two kinds of swap operators, one that is identical to that used in [13] and a second one where the two edges to be swapped are not chosen anywhere in the graph, but rather locally. This works as follows: first one chooses an edge eij uniformly at random among all the edges of graph g. Then a second edge ekl is selected among the edges belonging to a neighbor of i or j checking that no inconsistencies arise, i.e. ekl must not be adjacent to eij and there has not to be an already existing edge between the vertices that are candidate to be connected by the swap. Finally, eij and ekl are swapped as in the global move. In simulated annealing it is necessary to establish an initial temperature and a temperature schedule such that the temperature is progressively decreased during the run in order for the system to reach an equilibrium at each step of decreasing temperature. A suitable initial value for T is found by performing a random walk in the search space: the largest R difference in absolute value found is saved and the initial value of T is chosen such that the acceptance rate of all moves is at least 90% when the search starts. In the present study, we used the following rule: T0 = −|ΔR|max /ln(0.8) 4.5 × |ΔR|max , which is roughly half than recommended in [14] but should permit a faster convergence. As for the temperature lowering schedule, there are several possibilities but usually a linear or geometric scheduling is used. In order to save computational time, the update rule we employed here for T at the i−th constant-temperature search cycle (see algorithm 2), is given by T (i) = (0.8)i × T0 . For these choices we followed the rules of thumb in Chap. 15 of [14].
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing
173
Numerical results for network robustness R in the optimized networks, are reported in Fig. 2. The best results were obtained with simulated annealing and global edge swaps. The advantage with respect to the simple hill-climbing search is particularly clear for relatively small network sizes. As N increases, the results become more similar, although simulated annealing maintains the advantage. It is also clear from the figure that local edge swaps yield somewhat inferior results. The critical parameters in our simulated annealing are the initial temperature T0 , the number of steps performed at a given temperature (Nsteps in algorithm 2), and the geometric cooling factor α such that T (i) = αi × T0 . For larger network sizes these parameters should be suitably tuned for best results. However, because of time and resources limitations, we initially kept the parameters that worked well for smaller sizes in order to limit the computational expense; recent further tests with a slower cooling procedure, (α = 0.9 instead of α = 0.8), indeed, already improved on the global edges swaps (see Fig. 2). Hill Climbing Initial graph S.A. Global S.A. Local S.A Global α=0.9
0.3 0.28 0.26 0.24
R
0.22 0.2 0.18 0.16 0.14 0.12 0.1
100
120
140
160
180
200
220
240
260
280
300
n Fig. 2. Robustness R as a function of the network size. Simulated annealing results are averaged over 100 independent and randomly generated networks for each size; Hill-climbing results are redrawn from [13] and C. Schneider, personal communication.
Figures 3 and 4 show the results of the optimization process on two particular instances of size 300. The figures have been produced with the igraph [6] package in the R statistical environment [11] and depict nodes according to their coreness. The k-core of a network is the connected subset of vertices that have degree at least k. The left images of Figs. 3 and 4 are the original Barab´ asi-Albert networks, while the right images are the result of the optimization process with global and local edge swaps respectively. We observe that, whilst the original networks have a single core, the optimized ones, in spite of maintaining an identical degree distribution by construction, are more hierarchical. For the global
174
P. Buesser, F. Daolio, and M. Tomassini
Fig. 3. Robustness optimization with global rewiring. Vertex color is related to vertex coreness; vertex size is proportional to the logarithm of vertex degree; edges connecting vertices having the same degree are highlighted. The final network has R = 0.227.
Fig. 4. Robustness optimization with local rewiring. Vertex color is related to vertex coreness; vertex size is proportional to the logarithm of vertex degree; edges connecting vertices having the same degree are highlighted. The final network has R = 0.215.
rewiring, the resulting graph has three cores and shows the typical “onion-like” structure first found by Schneider et al. [13]. It thus appears that this topology is highly conducive to good robustness properties as it has been found by two rather different optimization techniques. In the local case, Fig. 4, the trend is similar but restricting swaps to the locale of a link produces less radical changes in the
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing
175
final topology, which could be advantageous in some real-life situations, although the robustness is slightly lower.
5
Summary and Conclusions
In this work we have performed an investigation of the robustness of Barab´ asiAlbert scale-free networks under attacks targeted to highly connected nodes. Following previous work by Schneider et al. [13], we have optimized the networks against this kind of perturbation without changing the degree sequence. To that effect, the move operator is a 2-swap of non-adjacent edges. Two versions of the swap were used, one in which the swap is global as in [13], and a second one in which the swapped edges belong to the neighborhood of the concerned nodes. Since the problem is a computationally hard one, we have used the simulated annealing heuristic to perform the optimization. The results are promising. Although we could only study networks up to size N = 300 because of time limitations, simulated annealing gave better results than the straightforward hill-climbing optimization used in [13]. The gain is larger when global edge swap is allowed but, even with local swaps the results are encouraging. In the last case the resulting networks require less actual rewiring to be produced from the original ones. Work is ongoing on larger networks, up to N = 500 at least, with a more adapted choice of the simulated annealing parameters. As future work, we believe that it would be interesting to study other types of attacks, as well as other network robustness measures such as network efficiency.
References 1. Albert, R., Barabasi, A.L.: Statistical mechanics of complex networks. Reviews of Modern Physics 74, 47–97 (2002) 2. Albert, R., Jeong, H., Barabasi, A.L.: Error and attack tolerance of complex networks. Nature 406, 378–382 (2000) 3. Amaral, L.A.N., Scala, A., Barth´elemy, M., Stanley, H.E.: Classes of small-world networks. Proc. Natl. Acad. Sci. USA 97, 11149–11152 (2000) 4. Bollob´ as, B.: Modern Graph Theory. Springer, Heidelberg (1998) 5. Cohen, R., Erez, K., Avraham, D.B., Havlin, S.: Breakdown of the Internet under intentional attack. Phys. Rev. Lett. 86, 3682–3685 (2001) 6. Csardi, G., Nepusz, T.: The igraph software package for complex network research. Inter Journal Complex Systems, 1695 (2006) 7. Holme, P., Kin, B.J., Yoon, C.N., Han, S.K.: Attack vulnerability of complex networks. Phys. Rev. E 65, 056109 (2002) 8. Kirkpatrick, S., Gelatt, C.D., Vecchi, P.: Optimization by simulated annealing. Science 220, 671–680 (1983) 9. Latora, V., Marchiori, M.: Efficient behavior of small-world networks. Phys. Rev. Lett. 87, 198701 (2001) 10. Newman, M.E.J.: Networks: An Introduction. Oxford University Press, Oxford (2010) 11. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2010)
176
P. Buesser, F. Daolio, and M. Tomassini
12. Schneider, C.M., Andrade, J.S., Shinbrot, T., Herrmann, H.J.: Protein interaction networks are fragile against random attacks and robust against malicious attacks. Tech. rep. (2010) 13. Schneider, C.M., Moreira, A., Andrade, J.S., Havlin, S., Herrmann, H.J.: Onionlike network topology enhances robustness against malicious attacks. J. Stat. Mech. (2010) (to appear) 14. Schneider, J.J., Kirkpatrck, S.: Stochastic Optimization. Springer, Berlin (2006) 15. Valente, A.X.C.N., Sarkar, A., Stone, H.: 2-peak and 3-peak optimal complex networks. Phys. Rev. Lett. 92, 118702 (2004)
Numerically Efficient Analytical MPC Algorithm Based on Fuzzy Hammerstein Models Piotr M. Marusak Institute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00–665 Warszawa, Poland
[email protected]
Abstract. Numerically efficient analytical MPC (Model Predictive Control) algorithm based on fuzzy Hammerstein models is proposed in the paper. Thanks to the form of the model the prediction can be described by analytical formulas and the proposed algorithm is numerically efficient. It is shown that thanks to a clever tuning of the controller most of calculations needed to derive the control value can be performed off– line. Thus, the proposed algorithm has the advantage reserved so far for analytical MPC algorithms based on linear models. At the same time, the algorithm offers practically the same performance as the MPC algorithm in which a nonlinear optimization problem must be solved at each iteration. The efficiency of the algorithm is demonstrated in the control system of a nonlinear control plant with delay. Keywords: fuzzy control, fuzzy systems, predictive control, nonlinear control, constrained control.
1
Introduction
The MPC algorithms use a model of the control plant to predict behavior of the control system during generation of the control signals. Therefore, the MPC algorithms can be successfully used in control systems of processes with difficult dynamics (e.g. with large delays) and constraints [2,6,12,14]. In the standard MPC algorithms linear control plant models are used. However, such an approach applied in the case of a nonlinear control plant may give unsatisfactory results especially if the control system should be able to work in different operating points. In such a case operation of the control system may be improved using the MPC algorithm in which prediction is based on a nonlinear model. Straightforward utilization of a nonlinear process model causes, however, necessity of solving a nonlinear (often non–convex) optimization problem at each iteration of the algorithm; see e.g. [1,3]. Such an optimization problem is usually hard to solve (numerical problems often occur). Moreover, time needed to find the solution is difficult to predict. These are the reasons why MPC algorithms with linear approximations of the control plant models, obtained at each iteration, are often used [5,7,8,9,10,14]. Among many types of the nonlinear models Hammerstein models have interesting properties. They are composed of a linear dynamic block which follows ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 177–185, 2011. c Springer-Verlag Berlin Heidelberg 2011
178
P.M. Marusak
Fig. 1. Structure of the Hammerstein model; u – inputs, y – outputs, z – outputs of the nonlinear static block
a static nonlinearity (Fig. 1). Such models can be used for many processes, like for example distillation columns or chemical reactors [4]. The static nonlinearity can be modeled in different ways. However, as the fuzzy models offer many advantages [11,13], like e.g. relative easiness of model identification and simple obtaining of linear approximation, the Hammerstein models with fuzzy static part are considered in the paper. The efficient method of prediction generation using a fuzzy Hammerstein model and its linear approximation was proposed in [9]. Efficient numerical fuzzy MPC algorithm, formulated as the standard quadratic programming problem, was also proposed there. In this paper it is shown that the discussed prediction can be used to formulate a numerically efficient analytical fuzzy MPC algorithm. This algorithm is formulated in such a way that the main part of calculations needed to derive the control value is performed off–line. Therefore, even solving of the quadratic programming problem is avoided and the algorithm can be applied to fast control plants. In the next section the standard analytical MPC algorithm based on linear models is described. In Sect. 3 the proposed analytical fuzzy MPC algorithm is detailed. Example results illustrating excellent performance offered by the proposed algorithm are presented in Sect. 4. The paper is summarized in Sect. 5.
2
Analytical MPC Algorithm Based on Linear Models (LMPC)
Control signals are generated in the Model Predictive Control (MPC) algorithms using prediction of future behavior of the control plant many sampling instants ahead. The prediction is obtained using a process model. The values of control variables are calculated in such a way that the prediction fulfills assumed criteria. Usually, minimization of the following performance function is demanded [2,6,12,14]: p s−1 2 2 min JMPC = y k − yk+i|k + λ · Δuk+i|k , (1) Δu
i=1
i=0
where y k is a set–point value, yk+i|k is a value of the output for the (k +i)th sampling instant, predicted at the k th sampling instant, Δuk+i|k are future changes
Numerically Efficient Analytical Fuzzy MPC Algorithm
179
in manipulated variable, Δu = Δuk|k , . . . , Δuk+s−1|k , λ ≥ 0 is a weighting coefficient; p and s denote prediction and control horizons, respectively. The predicted values of the output variable yk+i|k are derived using a dynamic control plant model. If this model is linear then the superposition principle can be applied and the vector of predicted output values y is described by the following formula: y = y + A · Δu , (2) where y = yk+1|k , . . . , yk+p|k ; y = y k+1|k , . . . , y k+p|k is a free response of the plant which contains future values of the output variable calculated assuming that the control signal does not change in the prediction horizon; A · Δu is the forced response which depends only on future changes of the control signal; ⎡ ⎤ a1 0 . . . 0 0 ⎢ a2 a1 . . . 0 0 ⎥ ⎢ ⎥ A=⎢ . . . (3) ⎥ . .. . . . . ⎣ . . ⎦ . . . ap ap−1 . . . ap−s+2 ap−s+1 is a matrix composed of coefficients of the control plant step response ai . It is called the dynamic matrix. Introduce the vector y = [y k , . . . , y k ] of length p. The performance function from (1) rewritten in the matrix–vector form is as follows: JMPC = (y − y)T · (y − y) + ΔuT · Λ · Δu ,
(4)
where Λ = λ · I is the s × s matrix. After application of the prediction (2) to the performance function (4) one obtains: JLMPC = (y − y − A · Δu)T · (y − y − A · Δu) + ΔuT · Λ · Δu ,
(5)
which depends quadratically on decision variables Δu. Thus, if the problem without constraints is considered, the vector minimizing the performance function (5) is described by the following formula: −1 . Δu = AT · A + λ · I · AT · (y − y)
(6)
−1 The matrix K = AT · A + λ · I · AT depends on the matrix A which is constant. Thus the most complex part of calculations can be performed off–line. Remark. In the analytical MPC algorithms the control constraints are taken into consideration by using a mechanism of control projection on constraint set; see e.g. [14]. The mechanism is simple as it consists in application of the following rules of modification of increments of the manipulated variable: • for changes of the manipulated variable: — if Δuk|k < Δumin , then Δuk|k = Δumin , — if Δuk|k > Δumax , then Δuk|k = Δumax ;
180
P.M. Marusak
• for values of the manipulated variable: — if uk−1 + Δuk|k < umin , then Δuk|k = umin − uk−1 , — if uk−1 + Δuk|k > umax , then Δuk|k = umax − uk−1 .
3
Analytical MPC Algorithm Based on Fuzzy Hammerstein Models (FMPC)
It is assumed that the process model is of the Hammerstein structure (i.e. the nonlinear static block is followed by the linear dynamic block) with fuzzy Takagi– Sugeno static part: zk = f (uk ) =
l j=1
wj (uk ) · zkj =
l
wj (uk ) · (bj · uk + cj ) ,
(7)
j=1
where zk is the output of the static block, wj (uk ) are weights obtained using fuzzy reasoning, zkj are outputs of local models in the fuzzy static model, l is the number of fuzzy rules in the model, bj and cj are parameters of the local models in the fuzzy static part of the model. It is also assumed that the dynamic part of the model has the form of the step response: yk =
p d −1
an · Δzk−n + apd · zk−pd ,
(8)
n=1
where yk is the output of the fuzzy Hammerstein model, ai are coefficients of the step response, pd is the horizon of the process dynamics (equal to the number of sampling instants after which the step response can be considered as settled). The proposed algorithm is based on prediction obtained in a way described in [9]. The idea of this prediction is to use the Hammerstein model (8) to obtain the free response. It is expressed by the following analytical formula: y k+i|k =
p d −1
an · Δzk−n+i + apd · zk−pd +i + dk ,
(9)
n=i+1
where dk = yk − yk is the DMC–type disturbance model, i.e. it is assumed the same for all instants in the prediction horizon. Next, the dynamic matrix, needed to predict the influence of the future control changes is derived using at each algorithm iteration a linear approximation of the fuzzy Hammerstein model (8): ykL
= dzk ·
p −1 d
an · Δuk−n + apd · uk−pd
,
(10)
n=1
where dzk is a slope of the static characteristic near the zk . It can be calculated numerically using the formula
Numerically Efficient Analytical Fuzzy MPC Algorithm
dzk =
l
181
(wj (uk + du) · (bj · (uk + du) + cj ) − wj (uk ) · (bj · uk + cj )) /du ,
j=1
where du is a small number. Thus obtained: ⎡ a1 0 ⎢ a2 a1 ⎢ Ak = dzk · ⎢ . .. ⎣ .. .
(11) finally the following dynamic matrix is ⎤ ... 0 0 ... 0 0 ⎥ ⎥ (12) . .. ⎥ . .. .. . . ⎦
ap ap−1 . . . ap−s+2 ap−s+1 The prediction is therefore described by: y = y + Ak · Δu .
(13)
After application of prediction (13) to the performance function (4) one obtains: JFMPC = (y − y − Ak · Δu)T · (y − y − Ak · Δu) + ΔuT · Λ · Δu .
(14)
The performance function (14) depends quadratically on decision variables Δu. Thus, if constraints are not taken into consideration, the vector minimizing the performance function (14) at each iteration is described by the following formula: −1 . Δu = ATk · Ak + λ · I · ATk · (y − y) (15) This time, however, on the contrary to the MPC based on a linear model, the dynamic matrix is changing at each iteration. Fortunately, thanks to the form of the dynamic matrix obtained from the Hammerstein model and clever tuning of the controller, number of on–line calculations can be reduced largely, as in the case of the LMPC algorithm. Assume that the matrix which contains weighting coefficient can be changed at each iteration, then −1 . Δu = ATk · Ak + λk · I · ATk · (y − y) (16) Assume also that λk = dzk2 · λ, in such a case, after using (12) one obtains: −1 , Δu = dzk2 · AT · A + dzk2 · λ · I · dzk · AT · (y − y) (17) and finally:
−1 1 T . · A ·A+λ·I · AT · (y − y) (18) dzk What can be written as: 1 . Δu = · K · (y − y) (19) dzk Thus, as in the case of the LMPC algorithm the main part of calculations can be −1 performed off–line as the matrix K = AT · A + λ · I · AT does not change. Therefore, despite the nonlinear fuzzy Hammerstein model was used the main advantage of the analytical LMPC algorithm is retained. Δu =
182
4 4.1
P.M. Marusak
Simulation Experiments Control Plant
The experiments were made in the control system of the ethylene distillation column used for tests in [9]. The control plant is highly nonlinear and has large time delay. The output y is the impurity of the product. The manipulated variable u is the reflux. During experiments it was assumed that the reflux is constrained 4.05 ≤ u ≤ 4.4. The Hammerstein model of the plant is depicted in Fig. 2a and the steady–state characteristic – in Fig. 2b.
Fig. 2. a) Hammerstein model of the distillation column; b) Steady–state characteristic of the plant
Fig. 3. Membership functions of the static part of the Hammerstein model
Numerically Efficient Analytical Fuzzy MPC Algorithm
183
The static part of the fuzzy Hammerstein model has the form of the Takagi– Sugeno model with three local models of the form: zkj = bj · uk + cj , where b1 = −2222.4, b2 = −1083.2, b3 = −534.4, c1 = 9486, c2 = 4709.3, c3 = 2408.7. The assumed membership functions are shown in Fig. 3. 4.2
Results
In order to test properties of the proposed approach, three MPC algorithms were designed: • an NMPC one (with nonlinear optimization), • an LMPC one (analytical based on a linear model) and • an FMPC one (the proposed analytical algorithm based on the fuzzy Hammerstein model). The sampling period was assumed equal to Ts = 20 min; tuning parameters of all three algorithms were as follows: prediction horizon p = 44, control horizon s = 20. During the experiments performance of control systems with all three algorithms was compared. The example responses of control systems to changes of the set–point value are shown in Fig. 4. It was assumed that the weighting coefficient λ = 2 × 106 . It was done so because for λ = 106 in the NMPC algorithm numerical problems occurred. On the contrary to the NMPC, the proposed FMPC algorithm did not have any problems with control calculation for λ = 106 (solid lines in Fig. 4) and generated the fastest responses. Slightly slower responses were obtained with the FMPC algorithm for λ = 2 × 106 (dash–dotted lines in Fig. 4). It is also good to notice that in both cases, the control signal in response
Fig. 4. Responses of the control systems to the change of the set–point values to y1 = 200 ppm, y2 = 300 ppm and y3 = 400 ppm; NMPC – dotted lines, LMPC – dashed lines, FMPC with λ = 2 × 106 – dash–dotted lines, FMPC with λ = 106 – solid lines
184
P.M. Marusak
to the set–point change to y 3 = 400 ppm hits the constraint. Despite that the responses are practically the same as in the case on the NMPC algorithm (dotted lines in Fig. 4) in which a constrained optimization problem is solved at each iteration. Both FMPC and NMPC algorithms give satisfactory responses. The overshoot is very small and the character of these responses is the same for different set– points. Unfortunately, the LMPC algorithm (dashed lines in Fig. 4) operates almost as good as other algorithms only for the set–point change to y 1 = 200 ppm. When the set–point changes to y 2 = 300 ppm or to y 3 = 400 ppm, LMPC algorithm works unacceptably bad what is caused by significant nonlinearity of the control plant.
5
Summary
The efficient analytical MPC algorithm based on the fuzzy Hammerstein model was proposed in the paper. The nonlinear model is used to derive the free response of the control plant and its linear approximation to calculate the influence of future control action. Thanks to such an approach and clever tuning the most computationally demanding part of calculations needed to derive the control value is in the proposed algorithm performed off–line. Despite significant simplicity the algorithm outperforms its counterpart based on linear process models and offers great control performance comparable with the one offered by the algorithms with nonlinear optimization. Moreover, the proposed algorithm is more computationally robust than the algorithm with nonlinear optimization. Acknowledgment. This work was supported by the Polish national budget funds for science 2009–2011 as a research project.
References 1. Babuska, R., te Braake, H.A.B., van Can, H.J.L., Krijgsman, A.J., Verbruggen, H.B.: Comparison of intelligent control schemes for real–time pressure control. Control Engineering Practice 4, 1585–1592 (1996) 2. Camacho, E.F., Bordons, C.: Model Predictive Control. Springer, Heidelberg (1999) 3. Fink, A., Fischer, M., Nelles, O., Isermann, R.: Supervision of nonlinear adaptive controllers based on fuzzy models. Control Engineering Practice 8, 1093–1105 (2000) 4. Janczak, A.: Identification of nonlinear systems using neural networks and polynomial models: a block–oriented approach. Springer, Heidelberg (2005) 5. Lawrynczuk, M.: A family of model predictive control algorithms with artificial neural networks. International Journal of Applied Mathematics and Computer Science 17, 217–232 (2007) 6. Maciejowski, J.M.: Predictive control with constraints. Prentice Hall, Harlow (2002) 7. Marusak, P.: Advantages of an easy to design fuzzy predictive algorithm in control systems of nonlinear chemical reactors. Applied Soft Computing 9, 1111–1125 (2009)
Numerically Efficient Analytical Fuzzy MPC Algorithm
185
8. Marusak, P.: Efficient model predictive control algorithm with fuzzy approximations of nonlinear models. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 448–457. Springer, Heidelberg (2009) 9. Marusak, P.: On prediction generation in efficient MPC algorithms based on fuzzy Hammerstein models. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010. LNCS (LNAI), vol. 6113, pp. 136–143. Springer, Heidelberg (2010) 10. Morari, M., Lee, J.H.: Model predictive control: past, present and future. Computers and Chemical Engineering 23, 667–682 (1999) 11. Piegat, A.: Fuzzy Modeling and Control. Physica-Verlag, Berlin (2001) 12. Rossiter, J.A.: Model-Based Predictive Control. CRC Press, Boca Raton (2003) 13. Takagi, T., Sugeno, M.: Fuzzy identification of systems and its application to modeling and control. IEEE Trans. Systems, Man and Cybernetics 15, 116–132 (1985) 14. Tatjewski, P.: Advanced Control of Industrial Processes; Structures and Algorithms. Springer, London (2007)
Online Adaptation of Path Formation in UAV Search-and-Identify Missions Willem H. van Willigen1,2 , Martijn C. Schut1 , A.E. Eiben1 , and Leon J.H.M. Kester2 1
VU University Amsterdam (NL) TNO Defence, Security and Safety, The Hague (NL) {willem,mc.schut,ae.eiben}@few.vu.nl,
[email protected] 2
Abstract. In this paper, we propose a technique for optimisation and online adaptation of search paths of unmanned aerial vehicles (UAVs) in search-and-identify missions. In these missions, a UAV has the objective to search for targets and to identify those. We extend earlier work that was restricted to offline generation of search paths by enabling the UAVs to adapt the search path online (i.e., at runtime). We let the UAV start with a pre-planned search path, generated by a Particle Swarm Optimiser, and adapt it at runtime based on expected value of information that can be acquired in the remainder of the mission. We show experimental results from 3 different types of UAV agents: two benchmark agents (one without any online adaptation that we call ‘naive’ and one with predefined online behaviour that we call ‘exhaustive’) and one with adaptive online behaviour, that we call ‘adaptive’. Our results show that the adaptive UAV agent outperforms both the benchmarks, in terms of jointly optimising the search and identify objectives. Keywords: adaptive algorithm, design and engineering for self-adaptive systems, unmanned aerial vehicles, search and identify.
1
Introduction
One of the most prevalent and important issues in reconnaissance, surveillance, and target acquisition (RSTA) flight missions is the ability to adapt one’s flight path based on acquired information. In such (often military) missions, planes acquire information about a specific territory by first exploring it, followed by surveilling and finally obtaining information about possible targets in the area. While some information about the territory may be available beforehand (making a priori planning possible), it is increasingly important to do the planning during the mission itself because of the very dynamic nature of RSTA missions at present day (e.g., unknown territory, rapidly moving targets). The possibility of such automated adaptability during the mission becomes very important when we take the human out of the loop, as we employ unmanned aerial vehicles (UAVs) in RSTA missions. The problem that we address in this paper concerns the programming of such UAVs in situations where some information is available beforehand (for example, some knowledge about possible target ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 186–195, 2011. c Springer-Verlag Berlin Heidelberg 2011
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
187
locations throughout the territory), but where substantial performance may be gained by equipping the UAVs with online (in-flight) adaptation of the flight path based on collected real-time information. We employ a machine-learning approach to accomplish this. Machine learning has been used to deal with different issues in UAV research and development. For example, Berger et al. [2] use a co-evolutionary algorithm for information gathering in UAV teams; Allaire et al. [1] have used genetic algorithms for UAV real-time path planning; and Sauter et al. [7,5] have used a swarming approach (for which a ground sensor network for coordination purposes is needed). Recently, Pitre et al. [6] introduced a new measurement for (UAV) search and track missions. The introduced metric jointly optimises the objectives to 1) detect new targets, and 2) track previously detected targets. This particular metric has some desirable properties with respect to search-and-tracking: jointly optimises detection and tracking; easily compares different solutions; promotes early detection; encourages repeated observations of the same targets; and it is useful for resource management. However, this approach does not yet allow for online adaptation of the search path during the flight. In this paper, we provide a method for doing this. We build further on the work of Pitre et al. with two important differences: 1) we use the metric and calculations also for in-flight coordination and adaptation (whereas the original metric has reportedly only been used for off-line generation of paths), and 2) in our case study, the second objective (besides search) is to identify targets rather than tracking these. This paper is structured as follows. In Section 2, we present the details of our adaptive algorithm. We report on the conducted simulation study in Section 3. Finally, Section 4 concludes and provides some pointers for future work.
2
Model
In this section, we describe the model that we used in terms of (1) the problem setting (i.e., search-and-identification of targets in some terrain with UAVs), and (2) our solution approach (i.e., objective function and adaptive behaviour of the UAV). We describe both these aspects in detail below. Our solution approach enables a UAV to jointly optimise the objectives of searching and identification by a UAV in a given terrain. Although we have no exact knowledge on where targets are in the terrain (because that would render the search-aspect of the mission pointless), we have some a priori knowledge in terms of probability distributions over the terrain cells on whether a target could be there. Before the mission, we compute an optimal flight path for the UAV. When the UAV is in-flight, it is possible to adapt this path. The beforemission calculation of the optimal search path as well as the in-flight decision to-adapt-or-not is based on a number of value functions that are described in detail below.
188
2.1
W.H. van Willigen et al.
Problem Setting
Terrain. The terrain to-be-searched is 60 by 60 nautical miles (nmi). This consists of a mountainous area, a desert, a small forest and some roads. In Figures 1a and 1b, two maps of the terrain show the different types of terrain, and the different altitudes (that ranges from 856m to 2833m), respectively1 . In both figures, the straight lines depict roads in the terrain.
(a) Terrain types
(b) Altitude map
Fig. 1. Scenario Maps (Taken from [6])
A UAV that flies over the terrain cannot detect targets equally well in all types of terrain. We represent the ability-to-detect by means of a detection probability, denoted by pdot , where dot means detection-on-terrain. In Table 1a, the detection probabilities for the different types of terrain are shown. The right column of this table shows that the detection probabilities increase when targets are on a road. Table 1. Scenario Assumptions (a) Detection probabilities for different types of terrain
(b) Percentage of targets per terrain type
pdot pdot on road Desert 0.90 0.95 Mountain 0.5 0.75 Forest 0.10 0.50
Terrain type % Targets Mountain 90 Desert 7 Road 2 Forest 1
Targets. In this scenario, targets are stationary (i.e., non-mobile) objects located throughout the searched terrain. We consider all targets to be equally important (i.e., not prioritising with respect to a specific aim of a mission)2 . Targets can be identified better when they are observed longer. We represent this gradually improving identification by means of a single scalar value, which increases as a UAV observes the object longer. 1 2
These maps are the same that were used in [6]. In [6], extensions are introduced that allow for varying the target importance.
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
189
UAVs. The UAVs in our model are planes that fly with a constant speed of 100 knots (kt) at a constant altitude of 3,000 meters above sea level. As previously mentioned, the UAV flies a particular search path that was determined beforehand. The adaptability of the UAV is that upon observation of a target, it may decide to fly a circle over the target enabling better identification. This decision depends on the objective function presented later in this section. After finishing the circle, it continues its original search path. A UAV has only limited resources (e.g., fuel), thus when it decides to fly a circle, this means that the path shortens in the tail (details follow below). How much a UAV can see on the ground, depends on the altitude of the terrain. The detection range is defined as range(alt) = −6.5 · 10−4 · alt + 1.96, where alt is the altitude of the terrain. We assume a viewing angle of about 51 deg in every direction. In the lowest regions of the terrain, the detection range is 1.4 nmi, while in the higher regions, this number drops to just 0.1 nmi. The probability that a UAV detects a target on the ground, denoted by pdet (), is determined by the detection range: pdet (cell) =
pdot 0
if within range(alt) otherwise
(1)
where cell is a single location in the terrain. The UAV sensor automatically takes a picture every 30 seconds. In our scenario, a mission takes 2 hours, thus resulting in a total of 240 pictures taken and analysed. Finally, the maximum turning rate of the UAV is 2 degrees per second, which means that if the UAV wants to fly a circle above a certain object, this takes 3 minutes, or 6 pictures. Flying a circle above a target also means that the end of the search path is shortened by 3 minutes, or 6 pictures. 2.2
Solution Approach
We evaluate search paths by means of an objective function, based on (expected) value functions. This evaluation is needed for 1) the a priori calculations for determining optimal search paths, as well as for 2) in-flight adaptation of a search path. For the former (a priori search process), we provide more details in the following section. For the latter (in-flight adaptation), we provide details in this section after explaining the used value functions. We employ two different functions for evaluation: first, the value function, that computes the total value of a path after flying; and second, the expected value, that estimates the value of a (partial) path before flying and, in case of the adaptive agent, during the flight. T N The value function is defined as: V = t=1 n=1 utilityGain(n, t), where T is the number of discrete time intervals during the mission, N is the number of detected targets at time t, and utilityGain(n, t) is the gain in utility of information for target n at time t. The utility gain function utilityGain(n, t) can be interpreted as the number of points scored for observing a target. Upon first observation of a target, the
190
W.H. van Willigen et al.
utility gain is 1. This increases linearly with time for the duration of observation of this target with a maximum utility gain of 6 per target. The reason for this maximum is that identification cannot improve after 6 detections. However, after 6 consecutive non-detections (when a target seen before is now undetected), known information about that target is reset which means that when the UAV encounters that target after that time, new information can be gained yet again for that target. We define the expected value function of a UAV search path as: E(V ) = T C t=1 c=1 pdet (c)ptarget (c), where T is the number of discrete time intervals during the mission, C is the number of cells within the detection range of the UAV at time t, pdet (c) is the probability of detection. This number depends on the type of terrain at cell c, and ptarget (c) is the probability of a target being present at cell c. We assume this information to be available and, because of the high resolution of the terrain, we also assume that no more than one target can be present at each cell. This formula thus estimates the number of targets that will be detected during the length of the mission based on the probabilities of 1) the presence of a target and 2) detection by the UAV. 2.3
UAV Adaptive Agent
The UAV agent determines the behaviour of the UAV in terms of adapting the flight path or not. The online adaptive agent will decide on flying a circle above a detected target based on the expected value of the remaining search path. Pseudocode for this agent is depicted in Algorithm 1, that runs each timestep of the flight, when a picture has been taken. /* The UAV starts flying the predetermined search path. At each timestep t when a picture is taken and analysed, the following code is executed: */ if the UAV detects a target that has not been seen before then /* Determine the expected value of the rest of the search path (from current timestep t until the final timestep T ) expValueWithout = E(V )t,T ;
*/
/* Determine the expected value of the search path when a circle is made. To do this, the expected value gets 6 points for the circle (unless the expected value of the circle is greater than 6), and the rest of the path has been made 6 steps shorter. */ expValueWith = E(V )t+6,T + max(6, E(V )t,t+5 ); if expValueWith > expValueWithout then flyCircle(); else keepFollowingOriginalPath(); end end
Algorithm 1. The algorithm for the online adaptive UAV agent
When the UAV is currently not flying a circle (because otherwise the UAV could start flying circles within circles and this would increase the complexity of getting back on the original path significantly), and a new target has been observed, two values are computed: the expected value of the rest of the search path without flying a circle (expValueWithout), and the expected value of the rest of the search path with the certainty of observing a certain target during the circle (expValueWith).
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
3
191
Experiments
In this section, we describe the experimental design and setup, the results we obtained and an analysis of these results. 3.1
Design and Setup
The main objective of this research is to investigate if our online adaptive UAV agent improves the value of a predefined search path. To this end, we compare our agent, as described in section 2, to two benchmark agents: The Naive Agent, in which the UAV has a predefined search path and the UAV will just follow this path without doing anything differently. The Exhaustive Agent is the other benchmark and has predefined online behaviour: the UAV starts flying the predefined search path and each time the UAV detects a target, it always decides to fly a circle around that target before continuing its path. This agent is necessary in our experiments, because if we want to show that it is beneficial for the value to sometimes fly a circle, we also need to show that it is not a good idea to always fly a circle. Our experimental design has 3 independent variables that we systematically vary to investigate the effects: 1) target distribution, 2) search path, and 3) agent type. – Target distributions: We have generated 10 different target distributions, each consisting of 1,000 targets, placed in the terrain using the distribution as shown in table 1b. For each type of terrain, the targets are normally distributed. – Search paths: We run the experiments on 10 different search paths. We generated search paths by hand and we ran a simple Particle Swarm Optimisation (PSO) technique [3] to optimise these search paths based on their expected value value. This work closely resembles the work described in [6]. After we ran the PSO algorithm for a fixed amount of time, we picked the 10 best paths for use in our experiments. – Type of agent: As explained above, there are three types of agents: the naive agent (without any online adaptation), the exhaustive agent (that will always fly a circle upon detection of a new target) and the adaptive online agent (that will base its decision of flying a circle on expected value calculations). The main measurable is the obtained value of a search path given a type of agent. The higher the value of a search path, the better. For each combination of a search path and a target distribution, we measure the value of the paths that are generated by the three different agents. We hypothesise that the utilities of the paths generated by the adaptive agents are better than the utilities of the paths generated by the naive and the exhaustive agents. We also measure the number of detected targets and the total number of detections. Using these two metrics, we can see to what extent the different agents are better in searching, identification, or both.
192
W.H. van Willigen et al.
The different types of terrain and the detection probabilities of the different types of terrain were explained above in Section 2. The UAV starts flying in the bottom right corner of the world. 3.2
Results
Before we present the results of our simulations, we give some illustrative screenshots of the simulation, showing different kinds of search paths (albeit somewhat simplified for reasons of clarity). Here, the UAV starts in the bottom right corner of the terrain, and each green dot is a location at which the UAV takes a picture which is then analysed using one the three agents. An example flight is shown in Figure 2. In Figures 3a and 3b, the results for every run are shown in terms of value differences between the adaptive and the naive/exhaustive agents, respectively. On the x-axis of these charts are the 10 different target distributions. For all these 10 target distributes, the results for the 10 different search
(a)
(b)
(c)
Fig. 2. (a) shows an example naive path, without online adaption; (b) shows an example exhaustive path, with many circles during the flight; and (c) shows an example adaptive path, with some circles here and there
(a) V(adaptive) - V(naive). Positive values mean that the adaptive agent has outperformed the naive agent.
(b) V(adaptive) - V(exhaustive). Positive values mean that the adaptive agent has outperformed the exhaustive agent.
Fig. 3. Differences between the adaptive agent and the benchmark agents
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
193
paths that we used are shown. On the y-axis, the difference in value is shown. Figures 4a and 4b are two histograms of the data from Figures 3a and 3b. From these histograms, it becomes clear that the data is not normally distributed, but slightly positively skewed. In the next section, we analyse this skewness. We also have included an example graph of this in Figure 5. The figure shows for each timestep that the value value of the search path up until that point. All lines are non-descending, since value will only increase over time. In table 2, the mean values for the total number of detections per run of the different agents is shown, as well as the mean number of uniquely detected targets per run. The ratio between these two values, which gives an indication on how well the identification objective is executed, is also included in this table. 3.3
Analysis
From Figures 3a and 3b, we can see that the adaptive agent generally performs better than the naive method, and much better than the exhaustive method. Some exceptions occur, for instance distribution 7. We analysed these exceptions and these UAV paths do not encounter as many targets as expected. The difference between the exhaustive and the adaptive agent are much larger. When many circles are flown in a short period of time, many targets will be detected for many more than 6 times, which yields no further utility gain. The histograms in Figure 4 are positively skewed. Using the Wilcoxon Signed-Rank test, we found that the adaptive agent is significantly better than the naive and exhaustive agents using a significance level of p = 0.05, which validates our hypothesis. Figure 5 depicts an example run. In this Figure, we observe that the naive agent does not significantly differ from the expected value. The exhaustive agent starts out well, but is outperformed by the other agents after some time. Note that Figure 5 is an example of one single run. Plots of other runs look differently. This can also be derived from the other plots; sometimes the naive or exhaustive agents are better. But generally, the plots follow this pattern.
(a) Histogram of Value(adaptive) Value(naive)
(b) Histogram of Value(adaptive) Value(exhaustive)
Fig. 4. Histograms of the differences between the adaptive agent and the benchmark agents
194
W.H. van Willigen et al. 300
Expected Utility Naive Controller Exhaustive Controller Adaptive Controller
250
(Expected) Utility
200 150 100 50 0 0
50
100
150
200
Time
Fig. 5. Value increase over time (example) Table 2. The mean values for the number of uniquely detected targets, the total number of detections and the ratio between these values Naive Exhaustive Adaptive # targets 171.04 68.44 148.99 # detections 315.04 362.22 347.09 detections / targets 1.84 5.29 2.33
Our second metric, i.e., the number of detections versus the number of uniquely detected targets, is depicted in Table 2. Using the numbers from this table, we can say something about strengths and weaknesses of each agent. We expected the naive agent to be the best in searching, the exhaustive agent to be the best in identification and the adaptive agent to be the best in jointly optimising these objectives. The naive agent has the highest mean number of uniquely detected targets, while the exhaustive agent has the highest ratio between the number of detections and the number of targets. The adaptive agent is best in jointly optimising these objectives.
4
Conclusions
In this paper, we propose a UAV agent that online adapts its predefined search path according to actual observations during the mission. The adaptive agent flies a circle above a detected target when it expects that this will improve the total value of the search path. Our results show that our agent significantly outperforms both a naive and an exhaustive agent. However, not in every instance the adaptive agent outperforms the other two; in some cases one of the benchmarks is better. This result can be attributed to unexpected situations during the flight. We also conclude that each agent has its own strength. It depends on the user’s goal which agent is best. In our scenario, we want to jointly optimise
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
195
search and identification objectives. Using these objectives jointly, our adaptive agent outperforms the benchmarks. But if searching was the only objective, the naive agent would be better; likewise, when identification was the only objective, the exhaustive agent would be the better one. As a future research path, we will generalise the model further by introducing different kinds of vehicles with different kinds of capabilities (e.g., helicopters, ground vehicles, underwater vehicles). We will investigate how to model different capabilities and how the different vehicles in the field can make use of other vehicle’s capabilities. Related work in this direction has been done by Kester et al. [4] to find a unifying way of designing Networked Adaptive Interactive Hybrid Systems.
References 1. Allaire, F.C.J., Tarbouchi, M., Labonte, G., Fusina, G.: Fpga implementation of genetic algorithm for uav real-time path planning. Intelligent Robot Systems 54, 495–510 (2009) 2. Berger, J., Happe, J., Gagne, C., Lau, M.: Co-evolutionary information gathering for a cooperative unmanned aerial vehicle team. In: 12th International Conference on Information Fusion, FUSION 2009, pp. 347–354 (2009) 3. Eberhart, R., Kennedy, J.: Particle swarm optimization. In: Proceeding of IEEE International Conference on Neural Networks, vol. 4 (1995) 4. Kester, L.J.M.H.: Designing networked adaptive interactive hybrid systems. In: Proceedings of IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, August 20-22, pp. 516–521 (2008) 5. Legras, F., Glad, A., Simonin, O., Charpillet, F.: Authority Sharing in a Swarm of UAVs: Simulation and Experiments with Operators. In: Carpin, S., Noda, I., Pagello, E., Reggiani, M., von Stryk, O. (eds.) SIMPAR 2008. LNCS (LNAI), vol. 5325, pp. 293–304. Springer, Heidelberg (2008) 6. Pitre, R.R., Li, X.R., DelBalzo, D.: A new performance metric for search and track missions 2: Design and application to UAV search. In: Proceedings of the 12th International Conference on Information Fusion, pp. 1108–1114 (2009) 7. Sauter, J.A., Matthews, R., Van Dyke Parunak, H., Brueckner, S.A.: Performance of digital pheromones for swarming vehicle control. In: AAMAS 2005: Proceedings of the Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 903–910. ACM, New York (2005)
Reconstruction of Causal Networks by Set Covering Nick Fyson1,2 , Tijl De Bie1 , and Nello Cristianini1 1
Intelligent Systems Laboratory, Bristol University, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK 2 Bristol Centre for Complexity Sciences, Bristol University, Queen’s Building, University Walk, Bristol, BS8 1TR, UK http://patterns.enm.bris.ac.uk
Abstract. We present a method for the reconstruction of networks, based on the order of nodes visited by a stochastic branching process. Our algorithm reconstructs a network of minimal size that ensures consistency with the data. Crucially, we show that global consistency with the data can be achieved through purely local considerations, inferring the neighbourhood of each node in turn. The optimisation problem solved for each individual node can be reduced to a set covering problem, which is known to be NP-hard but can be approximated well in practice. We then extend our approach to account for noisy data, based on the Minimum Description Length principle. We demonstrate our algorithms on synthetic data, generated by an SIR-like epidemiological model. Keywords: machine learning, network inference, data mining, complex systems, minimum description length.
1
Introduction
There has been increasing interest over recent years in the problem of reconstructing complex networks from the streams of dynamic data they produce. Such problems can be found in a highly diverse range of fields, for example in determining Gene Regulatory Networks (GRNs) from expression measurements [1], or the connectivity of neuronal systems from spike train data [2]. All share the similar challenge of extracting the causal structure of a complex dynamical system from streams of temporal data. We here address the distinct challenge of reconstructing networks from data corresponding to stochastic branching processes, occurring on directed networks and where a discrete ‘infection’ is propagated from node to node. The clearest analogy lies in the field of epidemiology, where instances of infection begin at particular nodes, before propagating stochastically along edges until the infection dies out. Recent work has seen a new approach to this problem based on a Maximum Likelihood framework, in which the approach was applied to meme data extracted from the international media and blogs [3,4]. We here outline our own approach to this problem, developed in parallel to that of Rodriguez et al., but a direct comparison is beyond the scope of this paper. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 196–205, 2011. c Springer-Verlag Berlin Heidelberg 2011
Reconstruction of Causal Networks by Set Covering
2
197
Network Reconstruction
In this paper we consider two networks over the same set of nodes, the true underlying network GT = (V, ET ) and the reconstructed network GR = (V, ER ). We assume a dynamic branching process occurs on the network GT , in which the transfer of ‘markers’ occurs. Markers originate at a particular node in the network, and then propagate stochastically from node to adjacent node, ‘traversing’ along only those edges that exist in the set ET . Due to the range of potential applications our description of the method remains intentionally abstract, but with analogy to epidemiology we use terms such as infection, infectious and carrier. Each marker that is propagated through the network generates a ‘marker trace’ M i , and the set of all marker traces is denoted by M = {M i}. The marker trace is represented by an ordered set of the nodes that carried that marker, in the order in which they became infected. We will use subscripts to refer to individual nodes in a marker trace. We formally define the notion of a marker trace as follows. Definition 1 (Marker Trace, M i ). A Marker Trace M i is an ordered set of ni distinct nodes wji ∈ V , and we denote it as M i = (w1i , w2i , . . . , wni i ) . Each marker trace defines a total order over the reporting nodes, and we use the notation vi <M i vj to state that the node vi appears before node vj in the marker trace M i . For clarity in future definitions we also formally define a path from one node to another within a network. Definition 2 (Path in a network G = (V, E)). A sequence U = u1 , . . . , uk of nodes ui ∈ V is a path in G = (V, E) if ∀ 1 ≤ i < k , (ui , ui+1 ) ∈ E. 2.1
Problem Formulation and Global Consistency
Problem 1 (Informal Description). Given a set M of Marker Traces construct a network GR , approximating the true network GT that generated M. Intuitively, it makes sense to choose GR such that it is capable of generating M. Given our assumptions on the mechanism of data generation, this requires that for each marker a path exists from the originator to all other carrier nodes, passing only through nodes that have been previously infected. We will refer to this as ‘global consistency’ and formalise the intuitive notion as follows. Definition 3 (Globally Consistent, GC) GR is GC with M i ⇐⇒ ∀ wji with j > 1 ∃ a path w1i , . . . , wji in GR Trivially, it is clear that a completely connected network is consistent with all possible data, and hence we aim to reconstruct a consistent set ER of minimal size. Combining the above allows us to formalise our goal in terms of an optimisation problem. Problem 2. subject to
argminER |ER | GR = (V, ER ) is GC with M i ∀ Mi ∈ M
198
2.2
N. Fyson, T. De Bie, and N. Cristianini
Local Consistency
For a reconstruction to make intuitive sense we require global consistency between network and data, but this is computationally impractical. Below, we demonstrate the equivalence of global consistency with ‘local consistency’, an alternative that allows us to consider the immediate neighbourhood of each node in turn. Local consistency requires that for each node reporting a particular marker, the node must have at least one incoming edge from a node that has reported the marker at an earlier time. This concept is formalised as follows. Definition 4 (Locally Consistent, LC) GR is LC with M i ⇐⇒ ∀wji with j > 1 ∃ wki with k < j
: (wki , wji ) ∈ ER
Theorem 1 (LC ⇐⇒ GC). Demonstrating local consistency between GR and M i is necessary and sufficient to ensure global consistency. Proof. This equivalence may be quite intuitively demonstrated by induction. Where k is the number of nodes in the reporting sub-network, for the case k = 1, we have only the originator node, hence trivially there is a path from originator to all other nodes. For the case k = 2, we add a node with an incoming edge from the only other node. Again, trivially there is a path from the originator to every other node. For the case k = n + 1 we take the network for k = n, and add a node with an incoming edge from one of the existing nodes. If there is a path from originator to all nodes in the k = n network, there will be a path from originator to the new node in the case k = n + 1. Hence if the claim is true for k = n it is also true for k = n + 1. Therefore, by induction, LC ⇐⇒ GC. This allows us to formulate an alternative but equivalent optimisation problem, using the concept of local consistency. Problem 3. subject to
argminER |ER | GR = (V, ER ) is LC with M i ∀ Mi ∈ M
Crucially, to establish local consistency we need only consider the immediate neighbourhood of each node in turn. This explanation of reports is necessarily performed for each node individually, and in each of these subproblems we establish the minimal set of incoming edges required to explain all the reported markers. From now on, unless otherwise specified, we describe approaches as applied to discovering the parents of a particular node, which can then be applied to each node in turn. 2.3
Formulation in Terms of Set Covering
We treat the reconstruction on a node-by-node basis and denote the node under consideration as v. As specified by local consistency, in considering the incoming edges for a particular node we must include at least one edge from a node that has reported each marker at an earlier time. Each edge therefore ‘explains’ the
Reconstruction of Causal Networks by Set Covering
199
presence of a subset of the reported markers, and if the set of all incoming edges together explains all the reported markers, we ensure local consistency. This problem of ‘explaining’ marker reports may be neatly expressed as a set covering problem. We first formally state the set covering optimisation problem: Given a universe A and a family B of subsets of A, the task is to find the smallest subfamily C ⊆ B such that C = A. This subfamily C is then the ‘minimal cover’ of A. Given this formal framework, we now define how these sets relate to our reconstruction problem. Definition 5 (Universe, Av ). The set of all markers that have been reported by the node v, given by Av = {i : v ∈ M i }. The node v can have an incoming edge from any other node, and hence the space of potential incoming edges is F v = (V /v) × v. As stated above, each potential incoming edge will ‘explain’ a subset of the markers reported by v, and therefore every edge fjv ∈ F v corresponds to one element Bjv in the family of subsets B v . Definition 6 (Family of subsets, B v = {Bjv }). Each subset Bjv is defined by a potential incoming edge (vj , v) = fjv ∈ F v , where i is in Bjv if and only if vj appears earlier than v in the marker trace M i , given by Bjv = {i : vj <M i v}. The set covering problem then requires us to find a subfamily C v ⊆ B v such that C v = Av , and this subfamily C v directly corresponds to a set of incoming edges for the node v. v ). The set of Definition 7 (Reconstructed Incoming Edges, ER v reconstructed edges, ER , consists of the set of all elements in F that correspond v to elements of C, given by ER = {fjv ∈ F v : Bjv ∈ C v }.
This then allows us to make a final definition of our optimisation problem, this time in terms of the set covering problem. The following problem is defined for each node v ∈ V . Problem 4. subject to
Av =
v | argminERv |ER
v v C v where C v = {Bjv : fjv ∈ ER } and ER ⊆ Fv
Finally, v repeating this optimisation for all nodes in the network, we get ER = v ER , allowing us to reconstruct the entire network through only local considerations. 2.4
Greedy Approximation to Set Covering
The set covering problem is known to be NP-hard, but in practice may be well approximated using a greedy approach. This algorithm is well documented for set covering, including bounds on worst-case performance [5,6], but here we briefly outline the approach. We wish to cover the set A by selecting from the family of subsets B. We first select the subset Bj ∈ B that covers the greatest number of
200
N. Fyson, T. De Bie, and N. Cristianini
elements in A, ie. such as to maximise |Bj |. The corresponding edge fj is then v added to the set of reconstructed edges ER . A subset of A has now been covered, and hence these elements are removed both from A and all subsets in the family B. This process is repeated until A = ∅.
3
Reconstruction from Noisy Data
The assumption thus far of noise-free data is unrealistic, and hence we define an extension of our approach. The basic approach assumes that every report of a marker is due to direct infection from a previously infected node, and hence requires that every report be explained. When noise is present these assumptions do not hold, and each missing marker report may imply false-positive edges in the reconstruction, a problem that will only be exacerbated by using more data. In executing the greedy approximation to set covering, we first select those subsets that cover the greatest number of remaining elements. While the noise level remains low, therefore, we will first select true edges, since they will tend to explain relatively more reports. We can therefore expect that the noise-induced false positives will be added toward the end. This is demonstrated empirically in Fig. 2a, and motivates our defining a criterion to halt the covering early. 3.1
Minimum Description Length
In selecting the optimal point to halt the set covering we appeal to the Minimum Description Length (MDL) principle [7]. This states that in model selection one should prefer models that are able to communicate the data in the lowest number of bits. This is in principle equivalent to considering Maximum Likelihood Estimation [8], but our case lends itself particularly well to the use of MDL. For the network, all edges are explicitly assigned 0 or 1 meaning the description is of fixed length, and hence does not in itself favour sparsity in the reconstructed network. The Description Length (DL) is dependent only on how efficiently the set of markers can be expressed in light of the network structure. Coding Scheme. To describe a marker trace we must specify in order all those nodes that are members of the set M i . A simple ordered list requires log N bits per node, where N is the number of nodes in the network. This is straightforward, but using the framework of the underlying network may allow us to describe this same information in a compressed form. Instead of simply listing the reporting nodes, we describe the progression of the marker through the network. To allow coding of markers that are not consistent with the network, we also define a ‘supernode’ in addition to the standard network, which is the originator of all markers, and by definition a parent of every other node. Each report in a trace requires us to first identify the infecting parent, and then which of its children is the new carrier. All markers start at the ‘supernode’, so the cost of describing the first report is log 1 = 0 to identify the parent and log dp1 = log N to specify the child (since the ‘supernode’ has an out-degree of N ). For the second report there are now two potential parents, so the cost is
Reconstruction of Causal Networks by Set Covering
201
(log 2 + log dp2 ), where dp2 is the out-degree of the cheapest possible infecting parent (which where possible will be a standard node, otherwise the more costly ‘supernode’). Similarly, the cost for the third report is (log 3 + log dp3 ) bits, the fourth (log 4 + log dp4 ) and so on. 3.2
MDL as Stopping Criterion
While there is no explicit cost to defining edges, nodes of higher degree are more expensive to use as the parent of a report. Adding a new outgoing edge to a node may make the description of one report cheaper, but will increase the cost for all other reports using that node as a parent. In general, therefore, the network that allows the shortest description of all marker traces will lie at some point between completely disconnected and completely connected. To use MDL as a stopping criterion requires the final addition of edges to be considered globally. We still perform greedy set covering for each node in turn, but make a note of each edge and the number of additional elements covered when it is selected. After doing this for all nodes we have a list of edges across the whole network, along with their explanatory power within the greedy set covering framework. We then rank them all by the elements covered, and follow this order in adding edges to the network. We can therefore calculate the change in DL as edges are added, and subsequently select the network ER that gives the lowest total description length.
4
Empirical Evaluation
To assess the quality of our reconstruction we use both precision-recall curves and the Jaccard Distance (JD) between the set of true and reconstructed edges. For identical sets this has a value zero, and a value of one if the two sets have no elements in common at all. The JD is given by JD = ( |ET ∪ ER | − |ET ∩ ER | ) / |ET ∪ ER | . 4.1
(1)
Generation of Synthetic Data
We use two models for directed networks of N nodes. The first is an Erd˝osR´enyi model in which each edge exists with probability p = 2/N , resulting in a sparse network that is likely to be dominated by a single weakly-connected giant component. The second is generated through a preferential attachment model, in which each new node forms incoming edges from two existing nodes [9]. This results in a network of the same average degree as the ER model, but with a scale-free topology. We generate markers with a process based on the SIR epidemiological model. Each node can be in one of three states; Susceptible (S), Infected (I) or Recovered (R), and all nodes begin in state S. One randomly selected node is set to state ‘I’, and the state of each node in all subsequent time steps is determined
202
N. Fyson, T. De Bie, and N. Cristianini
stochastically from its current state and that of all its parents. In each time step there is a probability pI = 0.1 that the infection will propagate along any edge leading from an infected to susceptible node, and a probability pR = 0.1 that any given infected node will recover. Parameters were selected to produce marker traces of reasonable length given network size. Additionally, we include a noise parameter, where ploss = 0.05 means 5% of all reports are lost. 4.2
Naive Approaches
We define two naive algorithms for network reconstruction, to which it will be instructive to compare our set covering approach. The most immediately obvious explanation for the creation of a marker trace is that each node became infected by the node immediately preceding it in time. Simply taking the union of all these edges results in a network that is consistent with the data, and is given by i EN1 = (wni , wn+1 ) ∀n . (2) M i ∈M
In the second naive approach we consider only the first two reports from each trace. This does not use all the available information, but in the noise-free case guarantees no false positives, and is given by (w1i , w2i ) . (3) EN 2 = M i ∈M
1
1
0.8
0.8
0.6 0.4 0.2 0 0 (a)
False +ve rate
True +ve rate
Jaccard Distance
1
0.6 0.4 0.2
200 400 600 800 # markers naive 1
0 (b) 0 naive 2
200
400 600 # markers
800
set covering
0.1 0.01 0.001 0.0001 0 (c)
200
400 600 # markers
800
set covering bound
Fig. 1. Performance of set covering reconstruction, relative to naive approaches and theoretical bounds. For TPR, the data for naive 2 and set covering bound coincide. FPR for naive 2 is always zero, and hence not shown. Results are shown for Erd˝ os-R´enyi networks of 100 nodes.
4.3
Reconstruction from Noise-Free Data
Figure 1 shows the results of network reconstruction using our set covering algorithm. Our algorithm clearly exceeds the worst-case bounds and the naive approaches, indicating that the greedy heuristic performs well, and that we make
Reconstruction of Causal Networks by Set Covering
203
efficient use of the information. Figures 1b and 1c clearly show that false positives are the cause of the poor performance of the first naive approach. This is entirely expected, but illustrates that it is not sufficient to simply find any network that is consistent with the data. In contrast, given infinite data the set covering approach tends toward perfect performance, and at a faster rate than the second naive approach. 4.4
Reconstruction from Noisy Data
Figure 2a gives the empirical motivation for using a halting criterion, showing that for noisy data the closest match to the true network is obtained before the set covering is complete. The circles plotted in Fig. 2 indicate the point at which the minimum description length was obtained. Figures 2c and 2d demonstrate that halting using MDL includes the majority of true positives, but limits the inclusion of false edges. This is demonstrated even more clearly in Fig. 3, where results are shown both with and without the use of MDL stopping. Figure 3c shows that MDL halting bounds the rate of false positives as the amount of data increases, in stark contrast to results for the basic set covering approach.
0 (a) 0
500 # edges added
x 10
1
5
0 (b)0
500 # edges added ploss = 0.00
0.03 False +ve rate
0.5
10
True +ve rate
Description Length
Jaccard Distance
4
1
0.5
0 (c) 0
ploss = 0.05
500 # edges added
0.02 0.01 0 (d) 0
500 # edges added
ploss = 0.10
Fig. 2. Plots showing variation of JD, DL, TPR and FPR with the progress of set covering. Circles indicate the point at which the MDL criterion would have halted the covering. Each line shows reconstruction of a 100 node Erd˝ os-R´enyi network, using 1000 markers.
Finally, in Figure 4 we demonstrate the performance of our approach on much larger networks, using 3000 markers to reconstruct ER and scale-free networks of size 1000 nodes. As one would expect, the performance becomes progressively worse with higher noise levels, but as indicated by the symbols in Fig. 4a, our MDL stopping allows us to maintain good precision even with relatively high levels of noise. In Fig. 4b we see that the set covering reconstruction can achieve high performance for scale-free networks, and is in fact more robust to noise, but it is important to note that the MDL stopping does not work for this network topology. This is likely due to the presence of hubs in scale-free networks, and will require an alternative coding scheme to account for this.
204
N. Fyson, T. De Bie, and N. Cristianini
1
1
0.8
0.8
0.6 0.4 0.2
False +ve rate
True +ve rate
Jaccard Distance
0.1
0.6 0.4 0.2
0 0
500
(a)
1000 1500 # markers
2000
0 0 (b)
500
1000 1500 # markers
0.01
0.001
0.0001 0 (c)
2000
500
1000 1500 # markers
2000
1
1
0.8
0.8 precision
precision
Fig. 3. Plots showing clear benefit of MDL stopping for noisy data. Solid line is with MDL stopping, dotted line without. Network was Erd˝ os-R´enyi model of 100 nodes and ploss = 0.05.
0.6 0.4 0.2 0 0
p p
=0.00
loss
=0.01
0.6
0.4 0.6 recall
= 0.00
p
= 0.10
loss
loss
ploss=0.05 0.2
p
loss
0.4
ploss = 0.20
0.2
0.8
1
0 0
0.2
0.4 0.6 recall
0.8
1
Fig. 4. Precision-recall curves for reconstruction under various noise conditions, with symbols indicating the MDL halting point. Subfig (a) shows results for an Erd˝ os-R´enyi network, while (b) shows results for a scale-free network. Networks are of size 1000 nodes, reconstructed from 3000 markers.
5
Conclusions
Our work demonstrates a novel approach to the reconstruction of causal networks underlying stochastic branching processes, such as from data representing information flow or the spread of an epidemic on a network. Using the intuitive notion of consistency between a network and such data, we demonstrated that the entire network can be reconstructed node by node, using only local considerations. In this way, we were able to reformulate the problem in terms of the set covering problem, which is NP-hard but can be approximated well using an efficient greedy algorithm. The extension of our approach using Minimum Description Length as a criterion for halting the covering allows reconstruction to be performed on noisy data, and we demonstrated this for large networks of different topologies and with various noise levels. In further work we plan to investigate direct optimisation of the MDL cost function, as well as alternative coding schemes to account for different network topologies. Another avenue for extending the approach is the use of exact time information, rather than considering only the order of reports. Finally, we intend
Reconstruction of Causal Networks by Set Covering
205
to apply our methods to various real-life data sets, such as the propagation of memes on the media network [10,11], and fault propagation data [12]. Acknowledgments. Nick Fyson is supported by the Bristol Centre for Complexity Sciences (EPSRC grant EP/5011214) and Nello Cristianini is supported by a Royal Society Wolfson Merit Award. This work is partially supported by EPSRC grant EP/G056447/1 and by the European Commission through the PASCAL2 Network of Excellence (FP7-216866).
References 1. Sprinzak, D., Elowitz, M.B.: Reconstruction of genetic circuits. Nature 438(7067), 443–448 (2005) 2. Brown, E.N., Kass, R.E., Mitra, P.P.: Multiple neural spike train data analysis: state-of-the-art and future challenges. Nature Neuroscience 7(5), 456–461 (2004) 3. Leskovec, J., McGlohon, M., Faloutsos, C., Glance, N., Hurst, M.: Cascading behavior in large blog graphs. In: SDM 2007 (2007) 4. Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diffusion and influence. In: KDD 2010 (2010) 5. Chvatal, V.: A greedy heuristic for the set-covering problem. Mathematics of Operations Research 4(3), 233–235 (1979) 6. Slav´ık, P.: Improved performance of the greedy algorithm for partial cover. Information Processing Letters 64(5), 251–254 (1997) 7. Wallace, C.S., Boulton, D.M.: An information measure for classification. Computer Journal 11(2), 185–194 (1968) 8. MacKay, D.J.C.: Information Theory, Inference & Learning Algorithms, 1st edn. Cambridge University Press, Cambridge (2002) 9. Barab´ asi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 10. Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506 (2009) 11. Snowsill, T., Nicart, F., Stefani, M., De Bie, T., Cristianini, N.: Finding surprising patterns in textual data streams. In: Proceedings of Cognitive Information Processing 2010 (April 2010) 12. Nageswara Rao, S., Viswanadham, N.: Fault diagnosis in dynamical systems: a graph theoretic approach. International Journal of Systems Science 18(4), 687–695 (1987)
The Noise Identification Method Based on Divergence Analysis in Ensemble Methods Context Ryszard Szupiluk1,2 , Piotr Wojewnik1,2 , and Tomasz Zabkowski1,3 1
Polska Telefonia Cyfrowa Ltd., Al. Jerozolimskie 181, 02-222 Warsaw, Poland Warsaw School of Economics, Al. Niepodleglosci 162, 02-554 Warsaw, Poland Warsaw University of Life Sciences, ul. Nowoursynowska 159/34, Warsaw, Poland {rszupiluk,pwojewnik,tzabkowski}@era.pl
2 3
Abstract. In this paper we propose a divergence based method for noise detection in ensemble method context where the prediction results from different models are treated as a multidimensional variable that contains constructive and destructive latent components. The crucial stage is the proper destructive and constructive components classification. We propose to calculate the noisiness of the particular latent component as the divergence from chosen reference noise. It allows us to identify the wide range of noises besides the typical signals with close analytical form such as Gaussian or uniform. The real data experiment with load energy prediction confirms presented methodology. Keywords: Ensemble methods, Noise detection, Statistical divergences.
1
Introduction
The random noise detection is one of the fundamental tasks in signal and data processing [19]. It can be found in the context of ensemble methods based on signals separation. In this approach, the prediction results from different models are treated as a multidimensional variable containing hidden constructive and destructive components [17]. The destructive components result from errors in data (measurement, input, edition etc.), model misspecification or suboptimal model training. On the other hand, the constructive components are the elements of the predicted value reproduced by particular models. These latent components are identified using the methods of blind signal separation [4,5,9,15]. In this case, the method of model building is not crucial, because we do not formulate any assumptions in this regard. From this point of view, the presented approach differs from the other ensemble methods, where the model results are averaged and the ensemble methods might be successful only if the assumptions to the aggregated models structure are met [3,18]. One of the key issues in this method is the correct classification and distinction between destructive and constructive components. The latent components, both destructive and constructive, may have many different forms and characteristics. They can be described in terms ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 206–214, 2011. c Springer-Verlag Berlin Heidelberg 2011
The Noise Identification Method Based on Divergence Analysis
207
of regularity, variability, sparseness or randomness and some of these properties appear jointly. We will focus here only on the destructive components which can be treated as random noises, taking into account some of the above mentioned properties.
2
Data Decomposition for Prediction Ensemble
The ensemble method via blind signal separation is stated as follow. After the learning process we collect particular primary model results xi together in one T multivariate variable X = x1 , x2 , . . . , xn , X ∈ Rn×N , where N means number of observations. We assume that those prediction results contain certain latent T components S = s1 , s2 , . . . , sn . The latent component can be constructive or destructive for the prediction results. The constructive components sj = ˆsj are associated with true predicted value whereas the destructive components sl = ˜sl are responsible for errors. Next we assume that each prediction result is a linear mixture of the latent components where the relation can be described in a matrix form as X = AS, (1) T T = ˆs1 , ˆs2 , . . . , ˆsk , ˜sk+1 , . . . , ˜sn , S ∈ Rn×N , the where S = s1 , s2 , . . . , sn matrix , A ∈ Rn×n represents the mixing system. The relation (1) means matrix X factorization by latent components matrix S and mixing matrix A. Our aim is to find the latent components and reject the destructive ones (replace them with zero) and next mix the constructive components back to obtain improved prediction results as ˆ = AS ˆ = A ˆs1 , ˆs2 , . . . , ˆsk , 0k+1 , . . . , 0n T . X (2) The above ensemble methodology faces us with two fundamental problems: how to estimate the components S (or equivalently matrix n A), and how to choose the destructive ones. Some solutions to the first problem are proposed by blind signal separation (BSS) methods aiming at identification of unknown signals mixed in an unknown system [4,5,9]. The process of matrix A estimation, under several conditions, with BSS methods can explore different properties of the data like: independence, decorrelation, sparsity, smoothness, non-negativity. Many of those approaches are tested and successfully applied in practice for the model aggregation problem via BSS [17]. For the present consideration we can assume that matrix A and a set of latent components S are obtained from ICA or PCA algorithm [4,9]. But still it is an open question how to indicate the destructive components properly. It can be difficult task in general, because obtained components might be not pure constructive or destructive due to many reasons like improper linear transformation assumption or other statistic characteristics than explored by chosen BSS method. Consequently, it is possible that some component has constructive impact on one model and destructive on the other. There may also exist components destructive as a single but constructive in a group. As the simple solution we might eliminate each signal subset of S and check
208
R. Szupiluk, P. Wojewnik, and T. Zabkowski
the impact on the final results. This procedure is quite simple and often works well but for the higher number of components the method is computationally consuming. Therefore, we try to find the other method that can be used to classify the basis latent components. The one way is to associate the destructive components with random noises. It leads us to noise detection problem.
3
Random Noise Detection Problem
The term random has wide-ranging physical, epistemological, philosophical and mathematical meaning, and typically deeper insight in the randomness leads us to such phrases like uncertainty, complexity, predictability and even logic [14]. Consequently, random noise can be defined, described and analyzed in many ways, but probabilistic approach seems to be the most natural [18,19]. In such approach the typical noise model is described by Gaussian distribution what is strongly supported from theoretical point of view by Central Limit Theorem [10]. Of course there are many other noise models based on uniform distribution, mixtures of Gaussians, alpha-stable distributions, 1/f noises etc. [11,12,16]. The one of their common features is their close analytical form of the noise distributions expressed by probability distribution function, cumulative distribution function or characteristics function. Starting with such above theoretical assumptions we can perform the real empirical research like correlation analysis or R/S analysis [10,13,15]. The problem with noise is the fact that the mathematical definition and interpretation can be far from physical reality, especially when ”physical” means economic, social or medical phenomena [13,16]. The well-known example is the mathematical white noise defined as a sequence of random variables with independent identical distribution (IID) [6,18,19]. In the technological problems like signal filtration or system identification it is very convenient to model the random distortions as the white IID noises, especially when useful signals are regular and easily recognizable [7]. In other words, we have some a priori information how the noise and the desired signal look like. Thus, the white noise description might be a fine approximation of the signal disruptions, even if they are colored noises. The situation is more complicated if the mathematical description does not reflect the usefulness of the signal. We can observe the problem on the financial markets, where the logarithmic returns are described by stationary stochastic processes [16]. In practice, it means assumption that the rate of return is generated from the series of identically distributed random variables. Moreover, the variables are uncorrelated or transformed with Box-Jenkins methodology to obtain this property [1,2,16]. As a consequence the signal fulfilling the mathematical definition of the white noise might be both disruption (in technological applications) and e useful information (in financial time series). We can conclude that the n mathematical definition of random noise does not necessarily implicate its positive or negative role in the context of the analyzed problem. The situation complicates, if both
The Noise Identification Method Based on Divergence Analysis
209
the useful and disruption signals reveal similarity. We can observe the situation in case of model aggregation via BSS methods. The constructive and destructive components are mixture of random and deterministic signals with unknown distributions. We can distinguish the noise only if we define its properties a priori. But how can we formulate such a priori information, especially, if the data distributions are far from closed analytical distributions like Gaussian or uniform?
4
Reference Noise
To solve above problem we can find some reference noise or set of reference noises and compare them with the analyzed signal. The reference noise can be easy obtained from random number generators if we have a priori information. It seem especially adequate for non-Gaussian or multimodal w distributions, for which it can be more effective to obtain some reference noise signal, than to assume and check its statistical properties theoretically. The most interesting case is the reference noise identified from the data on the analyzed problem. In the prediction task the signal might be found from decomposition of the target value or input data. There are many ways to acquire a reference noise from the target, eg. signal decomposition, filtration or differentiation. In our approach, to find a reference noise z from target d we propose a simple formula: z(t) = d(t) − M Ad (n),
(3)
where M A(n) means n-point moving average. In this way we formulate the problem of random noise detection as the question whether a certain signal is similar to the reference random noise. The main question now is to measure the similarity between particular signal and the reference noise. For this task we apply a divergence approach.
5
Similarity Measure and Divergences
The similarity measure between two nonnegative sequences or patterns can be taken as divergence function [5]. A divergence function D(yz) between two variables z and y fulfills the following conditions: D(yz) ≥ 0 and D(yz) = 0 only if y = z. But the triangular inequality D(yz) ≤ D(yx) + D(xz) is not a necessary condition for the divergence. Such divergence functions are usually used in a manifold of probability distributions but can be also used for general data analysis. The most popular types of divergences are the following [5]: 1. Kullback-Leibler divergence DKL (zy) =
N zi zi ln , yi
(4)
i=1
2. Csiszar divergence defined as
y(t) DC (zy) = ϕ , z(t) i=1 N
(5)
210
R. Szupiluk, P. Wojewnik, and T. Zabkowski
where z(t) ≥ 0, y(t) ≥ 0 and the function ϕ : [0, ∞) → (−∞, ∞) is convex on (0, ∞) and continuous at zero. The divergence function (5) under some additional restriction that ϕ(1) = 0 and strict convexity at 1 can be used as a distance measure with several special cases like: (a) if ϕ(u) = (u − 1)2 then we obtain Pearson’s distance, δ−1
(b) if ϕ(u) = u(uδ2 −δ−1) + 1−u δ , where δ = Amari’s alpha divergences.
1+α 2 ,
then we obtain a family of
In general case, before application of divergence measure we should perform the data preprocessing to nonnegative vectors summing up to one. Due to the fact that we look for similarity in the signal shapes the preprocessing is not a significant limitation. Now, we assume, that primary prediction results X are decomposed into basis components S by one of BSS methods. The basis components are constructive or destructive. To find the destructive ones we make comparison with the reference noise z from section (4). Resulting disruptions are compared by chosen divergence to all the signals si . If the divergence is symmetric or close to symmetric, D(zsi ) = D(si z), then the signal si is similar to the noise z extracted from target variable and si should be removed. As the practical similarity measure we can take the symmetry factor q: D(zsi ) q = abs log . (6) D(si z) The rest of the signals are mixed back with system inverse to BSS decomposition. Resulting values will be the improved predictions, see (Fig.1). The full ensemble method with noise extraction might be presented with the following algorithm: 1. Collect the predictive models results into multivariate variable X, 2. Decompose the matrix X with BSS methods into signals S = [si ], i = 1, . . . , n, 3. Extract the random noise characterization z from target value e.g. with filtration methods or derive it from random generator , 4. Measure the divergence values D(zsi ), D(si z), i = 1, . . . , n , 5. Classify the components as destructive, if their divergence to noise si is ˜ m ∈ si D(zsi ) = D(si z), i = 1, . . . , n , symmetric or close to symmetric S T T and obtain S = s1 , s2 , . . . , sn = ˆs1 , ˆs2 , . . . , ˆsk , ˜sk+1 , . . . , ˜sn , S ∈ Rn×N , 6. Eliminate the destructive signals and mix them with system inverse to deˆ = AS, ˆ composition X ˆ 7. Resulting values X ∈ Rn×N are improved version of the predictions.
The Noise Identification Method Based on Divergence Analysis
Models
M1
Results
Basic Components
X2
Input data
Improved results ^ X1
S1
X1
M2
S2
BSS stage
211
Remixing
^ X2
(BSS -1 )
Reference noise extracted from input data
...
Xn
...
... Mn
Sn
^ Xn
Decision system
Fig. 1. Ensemble method with random noise identification and removal
6
Practical Experiment
The validation test of the proposed concept with noise detection is performed on the real problem of load prediction in the Polish power system. Our task is to forecast the hourly energy consumption in 24 hours based on the energy demand from the last 24 hours and calendar variables: month, day of the month, day of the week, and holiday indicator. We train six MLP neural networks with one hidden layer (with 12, 18, 24, 27, 30, 33 neurons respectively). The quality of the results is measured with MAPE criterion for following neural networks M1:MLP12, M2:MLP18, M3:MLP24, M4:MLP27, M5:MLP30, M6:MLP33. For such primary models we perform their ensemble with BSS methods. Table 1 presents the final results. The best prediction improvement is obtained after elimination the components (s4, s5, s6) for PCA and the components (s4, s6) for ICA. The Fig.2a) shows basis component after PCA decomposition. Histograms and autocorrelation functions for PCA components are presented in Fig.2b). In the same way Fig.3a) and Fig.3b) present components, its histograms and correlation functions after ICA Table 1. Prediction results for primary models and after BSS Methods
M1 Primary Results 2.392 PCA 2.304 ICA 2.410
M2 2.365 2.256 2.248
Models M3 M4 2.374 2.402 2.283 2.274 2.395 2.401
M5 2.409 2.255 2.423
M6 2.361 2.234 2.384
212
R. Szupiluk, P. Wojewnik, and T. Zabkowski
a)
b)
Fig. 2. Basis latent component after PCAa) and histograms and autocorrelation functions Basis latent component after PCA b)
separation. As we can see, the visual recognition of the noises seem to be a difficult task. In this experiment we want to examine if it is possible to identify the destructive components with reference noise and divergence symmetry method. The reference noise were the residuals obtained after decomposition (smoothing) of the target with moving average (k = 168 = 7 days ×24 hours). In Fig.4 we can observe the histogram and the autocorrelation function of target and residuals from the smoothing. a)
b)
Fig. 3. Basis latent component after ICA a) and histograms and autocorrelation functions Basis latent component after ICA b)
The Noise Identification Method Based on Divergence Analysis
213
Fig. 4. Histograms and autocorrelation functions of the target and reference noise Table 2. Similarity factor q for PCA components and reference noises by KL and Pearson divergence PCA component S1 S2 S3 S4 S5 S6 KL 0.261 0.149 0.175 0.411 0.212 0.547 Pearson 1.889 0.579 0.430 2.008 1.178 2.667
Noise model
Table 3. Similarity factor q for ICA components and reference noises by KL and Pearson divergence ICA component S1 S2 S3 S4 S5 S6 KL 0.431 0.308 0.489 0.178 0.182 0.116 Pearson 4.827 3.111 1.165 4.591 3.387 4.420
Noise model
We applied Kullback-Leibler and Pearson’s divergence to measure the similarity between the signals. The symmetry is measured with the factor q. The final results are presented in Tables 2-3. The low value of q factor refers to high similarity of signals. The Pearson measure is ineffective both for PCA and ICA what can be explained by strongly non-Gaussian signals. However, we can observe relatively good quality results both for PCA and ICA aggregations with Kulback-Leibler divergence. The result seems to be reasonable, because BSS methods and especially ICA are addressed mainly for the non-Gaussian signals.
7
Conclusions
The destructive components detection using divergence between latent component and the reference noise can be applied in the context of predictive models
214
R. Szupiluk, P. Wojewnik, and T. Zabkowski
aggregation. The approach is based on assumption that it is difficult to find the close analytical form for the destructive components distributions. Especially, Gaussian white noise characteristic might not be reasonable for the ICA decomposition addressed for non Gaussian (except one) signals in general. The experiments confirmed the validity of the proposed solutions. However, a number of research and methodological issues is still open. The most important include the way of proper identification and estimation of the reference noise. The other question is the divergence choice, where for this moment the most popular in statistics analysis Kullback-Leibler divergence seems adequate.
References 1. Bollerslev, T., Chou, R.Y., Kroner, K.: ARCH Modeling in Finance. Journal of Econometrics 52, 5–59 (1992) 2. Box, G.E.P., Muller, M.E., Jenkins, G.M.: Time Series Analysis Forecasting and Control, 2nd edn. Holden Day, San Francisco (1976) 3. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 4. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing. John Wiley, Chichester (2002) 5. Cichocki, A., Zdunek, R., Phan, A.-H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley, Chichester (2009) 6. Hamilton, J.D.: Time series analysis. Princeton University Press, Princeton (1994) 7. Haykin, S.: Adaptive filter theory, 3rd edn. Prentice-Hall, Upper Saddle River (1996) 8. Hoeting, J., Mdigan, D., Raftery, A., Volinsky, C.: Bayesian model averaging: a tutorial. Statistical Science 14, 382–417 (1999) 9. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley, Chichester (2001) 10. MacDonough, R.N., Whalen, A.D.: Detection of signals in noise, 2nd edn. Academic Press, San Diego (1995) 11. Mandelbrot, B.: Multifractals and 1/f noise. Springer, Heidelberg (1997) 12. Nikias, C.L., Shao, M.: Signal Processing with Alpha-Stable Distributions and Applications. John Wiley and Son, Chichester (1995) 13. Peters, E.: Fractal market analysis. John Wiley and Son, Chichester (1996) 14. Popper, K.R.: The Logic of Scientific Discovery. Hutchinson, London (1959) 15. Samorodnitskij, G., Taqqu, M.S.: Stable non-Gaussian random processes: stochastic models with infinitive variance. Chapman and Hall, N.York (1994) 16. Shiryaev, A.N.: Essentials of stochastic finance: facts, models, theory. World Scientific, Singapore (1999) 17. Szupiluk, R., Wojewnik, P., Zabkowski, T.: Prediction improvement via smooth component analysis and neural network mixing. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4132, pp. 133–140. Springer, Heidelberg (2006) 18. Therrien, C.W.: Discrete Random Signals and Statistical Signal Processing. Prentice Hall, New Jersey (1992) 19. Vaseghi, S.V.: Advanced signal processing and digital noise reduction. John Wiley and Sons, Stuttgart, Chichester (1997)
Efficient Predictive Control and Set–Point Optimization Based on a Single Fuzzy Model Piotr M. Marusak Institute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00–665 Warszawa, Poland
[email protected]
Abstract. The idea proposed in the paper consists in significant simplification of the control structure with a predictive control algorithm and a steady–state target optimization. It is done by application of only one fuzzy (nonlinear) dynamic control plant model for both: predictive control and set–point calculation. The approach exploits possibilities offered by a fuzzy model used by the predictive control algorithm. The fuzzy model is of Takagi–Sugeno type with step responses used as the local models. Such a model can be obtained relatively easy and well tuned using neural networks. The proposed approach, despite simplification of the control system, offers very good control performance. It is demonstrated using an example of a control system of a nonlinear chemical reactor with inverse response. Keywords: fuzzy control, fuzzy models, predictive control, nonlinear control, constraints.
1
Introduction
The classical, multilayer control structure in which set–point values for predictive control algorithms are generated by the optimization layer solving a nonlinear optimization problem may be inefficient when variability of disturbances is comparable with the dynamics of the control plant; see e.g. [1,4,5,12,17,18]. It is caused by relatively low frequency of intervention of the optimization layer being a result of computational burden of the nonlinear optimization problem. This drawback of the classical control structure can be eliminated by its appropriate modification. There are, in general, three approaches to do it. The first approach consists in supplementing the classical, multilayer control structure with the steady–state target optimization (SSTO) which is a linear programming problem based on linear approximations of a nonlinear steady– state control plant model, executed as often as the predictive control algorithm, recalculating the set–point values; see e.g. [1,4,5,12,17,18]. In the second approach, the optimization problem solved by the predictive control algorithm is integrated with the set–point optimization; see e.g. [5,17,18,19,20]. The third approach consists in using the predictive set–point optimizer; see e.g. [7,16,18]. This paper is focused on the SSTO approach. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 215–224, 2011. c Springer-Verlag Berlin Heidelberg 2011
216
P.M. Marusak
The basic version of the SSTO is based on the steady–state linear model derived from the dynamic linear model used in the predictive controller [1,4,12]. In the second version, the nonlinear steady–state process model is linearized at each iteration [5,12,17,18]. The second version of the SSTO gives usually better results than the first one. In the paper, it is shown that it can be simplified without harm to the performance of the control system. The idea of the proposed approach consists in clever usage of a dynamic fuzzy (Takagi–Sugeno) control plant model. This model can be tuned, using e.g. neural networks, to contain information about steady–state properties of the control plant. Then, using the tuned fuzzy model, linear approximations of both: dynamic and steady–state process models are obtained and used for both: control generation and set– point optimization. Thanks to such an approach the control system is simplified without loss of performance, and may work faster comparing to the case when a nonlinear steady–state model is linearized at each iteration. In the next section the idea of predictive control algorithms is shortly reminded. The Takagi–Sugeno (TS) fuzzy models used in the proposed approach are shortly described in Sect. 3. The set–point optimization problem solved in the optimization layer of the classical multilayer control structure is reminded in Sect. 4. Section 5 is dedicated to formulation of the SSTO problem. In Sect. 6 example results of the experiments performed in a control system of a nonlinear control plant with inverse response (a CSTR with van de Vusse reaction) are described, illustrating efficiency of the proposed approach. The paper is summarized in the last section.
2
Predictive Control Algorithms
The Model Predictive Control (MPC) algorithms during control generation use information about predicted behavior of the control system many sampling instants ahead. Moreover, constraints existing in the control system can be also taken into consideration. Thanks to such an approach it is possible to use all available knowledge about process and conditions of its operation during control action calculation. Predictive control algorithms are usually formulated as the following optimization problem [2,6,14,17]: ⎧ ⎫ ny p nu s−1 ⎨ 2 2 ⎬ j j min κj y k − yk+i|k + λm Δum (1) k+i|k Δu ⎩ ⎭ m=1 i=0
j=1 i=1
subject to: Δumin ≤ Δu ≤ Δumax ,
(2)
umin ≤ u ≤ umax ,
(3)
y min ≤ y ≤ y max , y jk
th
(4) j yk+i|k
where is a set–point value for the j output, is a value of the j th output for the (k + i)th sampling instant predicted at the k th sampling instant
Efficient Predictive Control and Set–Point Optimization
217
using a control plant model, Δum k+i|k are future changes in the manipulated variables, κj ≥ 0 and λm ≥ 0 are weighting coefficients for the predicted control errors of the j th output and for the changes of the mth manipulated variable, respectively, p and s denote prediction and control horizons, respectively, ny , nu denote number of output and manipulated variables, respectively; y is a vector j of (p · ny ) elements, composed of the predicted output values yk+i|k , Δu is a vector of (s · nu ) elements, composed of the future increments of manipulated variables Δum k+i|k , u is a vector of (s · nu ) elements, composed of future values of manipulated variables um k+i|k , Δumin , Δumax , umin , umax , y min , y max are vectors of lower and upper bounds of changes and values of the control signals and of the values of output variables, respectively. As a solution to the optimization problem (1–4) the optimal vector of changes in the manipulated variables is obtained. From this vector, the Δum k|k elements are applied in the control system. Then, the optimization problem is solved again in the next sampling instant. The optimization problem (1–4), in the case when future output values y are predicted using a linear model, is a well known, easy to solve quadratic optimization problem. Application of the algorithm based on a linear model to a nonlinear control plant may, however, give unsatisfactory results or operation of the control system may be improved using a nonlinear model to predict behavior of the control plant. Unfortunately, after direct usage of a nonlinear control plant model the problem (1–4) becomes a nonlinear, usually non–convex, optimization problem solving of which is difficult and time consuming, without guarantee of global optimum finding. Taking the facts given above into consideration, one may apply approaches that consist in obtaining, at each algorithm iteration, a linear approximation of a nonlinear control plant model. Then, it is used to formulate a quadratic programming problem; see e.g. [8,9,11,17].
3
Takagi–Sugeno Models for Efficient Predictive Control and Set–Point Optimization
Obtaining a linear approximation of a dynamic control plant model is especially easy in the case when fuzzy TS models are used; see e.g. [8,10]. The approach proposed in the paper exploits a TS fuzzy model with step responses used as the local models: Rule f : j f,j jy is Bnf,jy and if yky is B1 y and . . . and yk−n+1 f,ju u ujku is C1f,ju and . . . and ujk−m+1 is Cm nu p d −1 j,f j,m,f then yk = aj,m,f · Δum · um n k−n + apd k−pd m=1 n=1 j
(5)
,
where yky is the jy th output variable value at the k th sampling instant, ujku is f,j f,j the ju th manipulated variable value at the kth sampling instant, B1 y , . . . , Bn y ,
218
P.M. Marusak
f,ju C1f,ju , . . . , Cm are fuzzy sets, aj,m,f are the coefficients of step responses in the n th f local model, jy = 1, . . . , ny , ju = 1, . . . , nu , f = 1, . . . , l, l is number of rules. Output of the fuzzy model is calculated using the following formula:
ykj =
nu p d −1 m=1 n=1
m aj,m · Δum aj,m n k−n + pd · uk−pd ,
(6)
where aj,m = lf =1 w f · aj,m,f ,w f are the normalized weights calculated using n n fuzzy reasoning for current values of inputs and outputs of the control plant, see e.g. [15]. The model (5) can be obtained in a relatively simple way. It is sufficient to collect a few step responses of the control plant for a few operating points. The premise part of the model can be designed using expert knowledge, simulation experiments, fuzzy neural networks or all the mentioned techniques combined. It is good to notice that the model (6) may be interpreted as the step response of the control plant describing behavior of the control plant near the current operating point. This model may be used to predict behavior of the control plant in the same way as in the DMC algorithm [2,6,14,17]. Such an approach leads to formulation of the optimization problem (1–4), solved at each iteration by the algorithm, as the quadratic optimization problem [8].
4
Optimization Layer
In the classical multilayer control system the desired set–points are calculated by the optimization layer. The optimization problem solved in this layer has usually the following form (having economic meaning) [17]: T min JE (y, u) = cT u u − cy y
(7)
umin ≤ u ≤ umax ,
(8)
y min ≤ y ≤ y max ,
(9)
y = F u, d ,
(10)
y,u
subject to:
where F : IRnu × IRnd → IRny , F ∈ C 1 is a nonlinear, steady–state control plant model, cu ∈ IRnu and cy ∈ IRny are prices, nd is the number of disturbances affecting the control plant, y is a vector of length ny of the set–point values, u is a vector of length nu of control values corresponding to the set–point values y, calculated using the steady–state plant model, d is an estimate of disturbances, umin , umax are vectors of lower and upper bounds of manipulated variables, y min , y max are vectors of lower and upper bounds of output values, JE (y, u) is a performance function. The optimal solution of the optimization problem (7–10), is passed to control algorithms as the desired set–point values.
Efficient Predictive Control and Set–Point Optimization
219
The problem presented above is usually nonlinear thus its solution is usually time consuming. Therefore, it is repeated less often than the action of the controllers.
5
Steady–State Target Optimization
In the case when variability of disturbances is comparable with the dynamics of the control plant, application of the classical multilayer control structure, with low frequency of intervention of the optimization layer may bring results far from optimal. One of the solutions to this problem is to supplement the control structure with a steady–state target optimization (SSTO) [1,4,5,12,17]. The first version of the SSTO was based on a linear approximation of the steady–state process model obtained from the linear dynamic control plant model used by the predictive control algorithm [1,4,12]: T min JE (y, u) = cT u u − cy y
(11)
umin ≤ u ≤ umax ,
(12)
y min ≤ y ≤ y max ,
(13)
y = Hu + C(k) .
(14)
y,u
subject to:
where C(k) = y 0 (k +N |k)−Hu(k −1) and y 0 (k +N |k) is the value of predicted free output trajectory at the end of the prediction horizon. It should be noticed that the matrix H, in the linear approximation of the steady–state process model (14), is constant in time. It is thus intuitive that for highly nonlinear processes performance of the control system may be improved using a linear approximation of the steady–state process model obtained by linearization of the original, nonlinear steady–state process model [5,17]: y = H(k)u + C(k) . (15) ∂F (u, d) ∂F (u, d) where the matrix H(k) = ... is updated at each SSTO it∂u1 ∂unu eration, C(k) = F u(k − 1), d −H(k)u(k −1). The derivatives being elements of the matrix H(k) are usually computed numerically. The other version of SSTO is proposed in the paper. It is based on a linear approximation of the steady–state process model obtained at each iteration from a dynamic fuzzy TS model of the control plant. Using this approach the numerical linearization (calculation of the derivatives) of the nonlinear steady–state process model is avoided. Thanks to the usage of the fuzzy model with local models in the form of the step responses, derivation of the H(k) matrix is especially easy. It is because
220
P.M. Marusak
elements of the H(k) matrix are last elements of the step responses aj,m pd that are in fact gain coefficients of the control plant. The matrix of gains is described by the following formula: ⎡ 1,1 ⎤ u +nd apd . . . a1,n pd ⎢ . . ⎥ .. ⎥ . .. H(k) = ⎢ (16) . ⎣ .. ⎦ ny ,1 ny ,nu +nd apd . . . apd The matrix H(k) is obtained as a result of calculation of a linear approximation of the dynamic process model for the predictive control algorithm. In order to obtain proper values of the gains (good approximation of the steady–state properties of the process), the dynamic fuzzy model should be tuned as well as it is possible, using e.g. fuzzy neural networks, like it was done in the current research and discussed in the next section.
6
Simulation Experiments
The control plant under consideration is an isothermal continuous stirred tank reactor (CSTR) in which a van de Vusse reaction carries out (Fig. 1a) [3]. Steady– state characteristics of the control plant are shown in Fig. 1b. The process model of the reactor contains two composition balance equations dCA dt
2 = −k1 · CA − k3 · CA + VF (CAf − CA ) , dCB = k1 · CA − k2 · CB − VF CB , dt
(17)
where CA , CB are the concentrations of components A and B, respectively, F is the inlet flow rate (equal to the outlet flow rate), V is the volume in which the reaction takes place (it is assumed constant and V = 1 l), CAf is the concentration
Fig. 1. a) Diagram of an isothermal CSTR with van de Vusse reaction; b) Steady–state characteristics of the control plant
Efficient Predictive Control and Set–Point Optimization
221
of component A in the inlet flow stream (it is assumed that CAf = 10 mol/l). The values of parameters are: k1 = 50 1/h, k2 = 100 1/h, k3 = 10 l/(h · mol). The output variable is the concentration CB of substance B, the manipulated variable is the inlet flow rate F of the raw substance, CAf concentration is the disturbance variable and it is assumed that it is changing according to the formula: 2π CAf = CAf0 − sin t . (18) 0.4 where CAf0 = 10 mol/l. The following performance index of the set–point optimization problem was assumed: JE = −F .
(19)
The manipulated variable is constrained: 0 l/h ≤ F ≤ 60 l/h .
(20)
It was also assumed that the product should fulfill the following purity criteria: 1.1 mol/l ≤ CB ≤ 1.2 mol/l .
(21)
For the presented control plant a fuzzy predictive algorithm was designed. The sampling period was assumed equal to Ts = 3.6 s; tuning parameters were as follows: p = 70, s = 35, λ = 0.001. The fuzzy TS process model is based on step responses obtained near the following operating points: P1) CB0 = 0.91 mol/l, CA0 = 2.18 mol/l, F = 20 l/h; P2) CB0 = 1.12 mol/l, CA0 = 3 mol/l, F = 34.3 l/h; P3) CB0 = 1.22 mol/l, CA0 = 3.66 mol/l, F = 50 l/h. The fuzzy TS model was tuned using a fuzzy neural network (FNN). It was done in such a way that the obtained dynamic fuzzy model can be used also to obtain a good linear approximation of the nonlinear steady–state process model. The premises were modeled in a standard way; see e.g. [13]. The consequents were simply the gain coefficients of the obtained step responses. The structure of the applied FNN is thus very simple. The idea is to tune the membership functions taking into consideration the steady–state characteristic of the control plant. The value of the performance index J = (y F N N − y SSM )T · (y F N N − y SSM ) (where y F N N and y SSM are the vectors of 400 elements containing the output values obtained from the FNN and from the steady–state process model, respectively), calculated after the tuning, was equal to J = 0.1456. The obtained, normalized membership functions are shown in Fig. 2. The experiments were performed in three control systems: 1) with the MPC algorithm and SSTO based on a linear model (LSSTO), 2) with fuzzy model used in the MPC algorithm and with successive nonlinear steady–state model linearization (NSSTO),
222
P.M. Marusak
3) with the fuzzy TS model used in both the MPC algorithm and SSTO, exploiting the approach proposed in the paper (FSSTO). The obtained results are shown in Figs. 3 and 4. Usage of the steady–state model approximation obtained from fuzzy dynamic control plant model brought results very close to those obtained with linearization of the nonlinear steady–state process model. The difference between responses generated by the control systems with NSSTO and FSSTO is small. Values of the performance index (calculated as the sum of temporary values of the assumed economic performance index) obtained in these two control systems are equal to −9582.71 and −9589.99, respectively (the smaller the value – the better). Moreover, calculation of set–points in the structure based only on the fuzzy process model is significantly simplified. The control system with LSSTO (the standard SSTO) and the MPC algorithm based on a linear model offers
Fig. 2. Normalized membership functions of the fuzzy model
Fig. 3. Results of an example experiment obtained in the control system with: LSSTO – dotted line, NSSTO – dashed line, FSSTO – solid line; left – output variable CB , right – manipulated variable F
Efficient Predictive Control and Set–Point Optimization
223
Fig. 4. Results of an example experiment obtained in the control system with: LSSTO – dotted line, NSSTO – dashed line, FSSTO – solid line; left – output variable reference signals C B , right – disturbance variable CAf
the worst control performance (the value of the performance index is equal to −9404.17 in this case). Thus, usage of the well tuned fuzzy model is very important for the performance of the control system.
7
Summary
The problem of simplification of control systems with SSTO was addressed in the paper. The proposed solution consists in using the dynamic, fuzzy control plant model for both: control generation and set–point optimization. The linear approximation of the steady–state process model is obtained at each iteration using the dynamic fuzzy TS control plant model. Using the example control system of the control plant with difficult dynamics it is shown that the proposed approach may generate solutions very close to those obtained in control systems in which, at each SSTO iteration, linearization of the original nonlinear steady– state process model is performed. In the proposed approach the fuzzy Takagi–Sugeno models with step responses used as the local models are exploited. Such models can be obtained relatively easy and tuned using fuzzy neural networks, offering possibility of obtaining a linear approximation of the steady–state process model in a very easy way. Thus, the proposed approach may be relatively easy applied in the existing control systems to improve their performance. Acknowledgment. This work was supported by the Polish national budget funds for science 2009–2011 as a research project.
References 1. Blevins, T.L., McMillan, G.K., Wojsznis, W.K., Brown, M.W.: Advanced Control Unleashed. ISA (2003) 2. Camacho, E.F., Bordons, C.: Model Predictive Control. Springer, Heidelberg (1999)
224
P.M. Marusak
3. Doyle, F., Ogunnaike, B.A., Pearson, R.K.: Nonlinear model–based control using second–order Volterra models. Automatica 31, 697–714 (1995) 4. Kassmann, D.E., Badgwell, T.A., Hawkins, R.B.: Robust Steady-State Target Calculation for Model Predictive Control. AIChE Journal 46, 1007–1024 (2000) 5. Lawrynczuk, M., Marusak, P., Tatjewski, P.: Cooperation of model predictive control with steady–state economic optimisation. Control and Cybernetics 37, 133–158 (2008) 6. Maciejowski, J.M.: Predictive control with constraints. Prentice Hall, Harlow (2002) 7. Marusak, P., Tatjewski, P.: Actuator fault tolerance in control systems with predictive constrained set-point optimizers. International Journal of Applied Mathematics and Computer Science 18, 539–551 (2008) 8. Marusak, P.: Advantages of an easy to design fuzzy predictive algorithm in control systems of nonlinear chemical reactors. Applied Soft Computing 9, 1111–1125 (2009) 9. Marusak, P., Tatjewski, P.: Effective dual–mode fuzzy DMC algorithms with online quadratic optimization and guaranteed stability. International Journal of Applied Mathematics and Computer Science 19, 127–141 (2009) 10. Marusak, P.: Efficient model predictive control algorithm with fuzzy approximations of nonlinear models. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 448–457. Springer, Heidelberg (2009) 11. Morari, M., Lee, J.H.: Model predictive control: past, present and future. Computers and Chemical Engineering 23, 667–682 (1999) 12. Qin, S.J., Badgwell, T.: A survey of industrial model predictive control technology. Control Engineering Practice 11, 733–764 (2003) 13. Piegat, A.: Fuzzy modelling and control. Physica-Verlag, Heidelberg (2001) 14. Rossiter, J.A.: Model–Based Predictive Control. CRC Press, Boca Raton (2003) 15. Takagi, T., Sugeno, M.: Fuzzy identification of systems and its application to modeling and control. IEEE Trans. Systems, Man and Cybernetics 15, 116–132 (1985) 16. Saez, D., Cipriano, A., Ordys, A.W.: Optimisation of Industrial Processes at Supervisory Level: Application to Control of Thermal Power Plants. Springer, London (2002) 17. Tatjewski, P.: Advanced Control of Industrial Processes; Structures and Algorithms. Springer, London (2007) 18. Tatjewski, P.: Supervisory predictive control and on–line set–point optimization. International Journal of Applied Mathematics and Computer Science 20, 483–495 (2010) 19. Zanin, A., Tvrzska de Gouvea, M., Odloak, D.: Industrial implementation of a real–time optimization strategy for maximizing production of LPG in a FCC unit. Computers and Chemical Engineering 24, 525–531 (2000) 20. Zanin, A., Tvrzska de Gouvea, M., Odloak, D.: Integrating real–time optimization into model predictive controller of the FCC system. Computers and Chemical Engineering 10, 819–831 (2002)
Wind Turbines States Classification by a Fuzzy-ART Neural Network with a Stereographic Projection as a Signal Normalization Tomasz Barszcz1, Marzena Bielecka2 , Andrzej Bielecki3 , and Mateusz W´ ojcik4 1
Chair of Robotics and Mechatronics, Faculty of Mechanical Engineering and Robotics, AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krak´ ow, Poland 2 Chair of Geoinformatics and Applied Computer Science, Faculty of Gelogy, Geophysics and Environmental Protection, AGH University of Science and Technology, Al. Mickiewicza 30, 30-059, Krak´ ow, Poland 3 Institute of Computer Science, Faculty of Mathematics and Computer Science, Jagiellonian University, Nawojki 11, 30-072 Krak´ ow, Poland 4 Department of Computer Design and Graphics, Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Reymonta 4, 30-059 Krak´ ow, Poland
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. In this paper wind turbines operational states classification is considered. The fuzzy-ART neural network is proposed as a classifying system. Applying of stereographic projection as an input signals normalization procedure is introduced. Both theoretical justification is discussed and results of experiments are presented. It turns out that the introduced normalization procedure improves classification results. Keywords: wind turbines operational states classification, fuzzy-ART neural network, input signals normalization, stereographic projection.
1
Introduction
In recent years wind energy is the fastest growing branch of the power generation industry not only in the world [7] but also in the European Union [1], including Poland [13]. The largest cost for the wind turbine is its maintenance.
The paper was supported by the Polish Ministry of Science and Higher Education under Grant No. N504 147838.
ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 225–234, 2011. c Springer-Verlag Berlin Heidelberg 2011
226
T. Barszcz et al.
A common technique to decrease this cost is a remote monitoring [11]. Growing number of monitored turbines requires an automated way of support for diagnostic experts. Wind turbines operational states classification is the basic stage of analyzing data obtained from monitoring. In classification algorithms very often data describing features which are taken into consideration by the classifying algorithm must be preprocessed. Input signals normalization is an example of such a preprocessing. Classification algorithms are nontrivial in the case when the number of classes are unknown a priori. In such a way, certain types of artificial neural networks (ANNs) can be applied for solving the classification problem. In ART-type neural networks, introduced by Carpenter and Grossberg [4,5], the learning process is not separated from its operation. Furthermore, this sort of ANNs are capable to add new states when necessary. Therefore, they were tested as a tool for classification of operational states in continuous monitoring of wind turbines [3]. This paper is a continuation of studies presented there, where the obtained results were promising but far from satisfactory ones. It turns out that input signal normalization, based on the stereographic projection proposed in [2], improves the results obtained in [3]. In this paper we consider possibilities of application of fuzzy-ART artificial neural network to find clusters in data describing wind turbines operational states. This type on ANN was proposed by Carpenter, Grossberg and Rosen [6], and then was intensively studied [10,12], including its clustering capabilities [8]. The stereographic projection was used by us as the input signals preprocessing. The paper is organized in the following way. In Section 2 the classification problem of wind turbine states is discussed. The architecture of fuzzy ART neural network is reminded in Section 3, whereas in Section 4 a stereographic projection is presented. Results of experiments are presented in Section 5.
2
Classification Problem of Wind Turbine States
In recent years large development of monitoring and diagnostic technologies for wind turbines has taken place [11]. The growing number of installed systems created the need for analysis of gigabytes of data created every day by these systems. Apart from the development of several advanced diagnostic methods for this type of machinery there is a need for a group of methods, which will act as an ”early warning”. The idea of this approach could be based on a data driven algorithm, which would decide on a similarity of current data to the data, which are already known. In other words, the data from the turbine should be accounted for one of known states. If this is a state describing a failure, the human expert should be alarmed. If this is an unknown state, the expert should be informed about the situation and asked for a definition of such a new state. It seems that there are no works, apart from our one [3], which would consider application of ANNs networks for the classification of wind turbines states. As the ART neural networks are capable to perform efficient classification and to recognize new states when necessary [4,5], we performed research of initial
Wind Turbines States Classification by a Fuzzy-ART Neural Network
227
classification task. The goal of the experiment was verification of ART classification capabilities with comparison to the human expert. This type of data is acquired in the majority of cases and the successful classifier should create a reasonable number of classes, similar to these by a human expert. According to the results presented here - see Section 5 - the input signal transformation using stereographic projection improves results of classification. Having an accurate classification, it is thus possible to filter out states, which are known to be correct. The expert can then focus only on ”suspicious” states returned by the algorithm.
3
Fuzzy ART Network
The organization of a fuzzy ART network, introduced by Carpenter et al. [6], is presented in Fig.1. In comparison with a classical ART-2 network, the fuzzy ART network has an additional layer F0 which transforms input vectors using so called complement coding. The fuzzy ART network has a single weights matrix Z, which processes signals being sent both from F1 to F2 layer and vice versa. Operations done by the network are based on the fuzzy logic. The operator fuzzy AND is used for two vectors comparison. Signal processing in fuzzy ART network is similar to processing in ART-2 network. The input signal vector, say X, is transformed by F0 and F1 layers producing signals Tj which are put to inputs of the F2 layer. For the neuron which is excited most strongly a vigilance parameter ρ is used to check the similarity between the output pattern from the F2 layer and the input pattern. The vigilance parameter has considerable influence on the system: higher vigilance produces highly detailed memories - many, fine-grained categories, while lower vigilance results in more general memories - fewer, more-general categories. If the degree of similarity p between current input pattern and best fitting prototype is at least as high as vigilance, this prototype is chosen to represent the cluster containing the input. This means that if p < ρ then the winner neuron is inhibited and other neuron in F2 layer is searched. Otherwise, the weights matrix Z is modified in order to store the recognized vector features. The learning process is continued until the values of the matrix Z are stabilized.
4
Input Signals Normalization
A normalization procedure corresponds to founding a mapping F : Rn ⊃ A x → x ˆ ∈ Rk , where ˆ x = 1. The most commonly used normalization is done according to the formula x ˆ= x . This formula defines projection x Π : Rn \ {0} → S n−1 ⊂ Rn ,
228
T. Barszcz et al.
Fig. 1. Fuzzy ART architecture
Fig. 2. Simple projection of R2
call it a simple projection, of Rn \ {0} onto (n − 1)-dimensional sphere S n−1 see Fig.2 for n = 2. A simple projection has crucial drawbacks. First of all dimension is reduced. Secondly, the projection is not defined on the whole space - the mapping is undefined for 0. Furthermore, the space Rn having an infinite measure is projected onto a sphere having a finite measure. Additionally, the projection is not a injective mapping - if two points, say u and w, lie on the same radial line then Π(u) = Π(w) - see Fig.2. Referring to the considered problem this means that if two data clusters are situated along the same redial direction then, after normalization, they can not be separated even they are well separated before normalization. Therefore, this method should be used only in such cases if it is a priori known that clusters in input signal space are situated in various radial directions.
Wind Turbines States Classification by a Fuzzy-ART Neural Network
229
Fig. 3. Stereographic projection of R2
Therefore sometimes normalization which do not reduce the input signals space dimension is applied. The stereographic projection S : Rn → S n ⊂ Rn+1 is an example of such a mapping. Geometric interpretation of the stereographic projection is visualized in Fig.3 for a two-dimensional case. Stereographic projection is given explicitly by algebraic formulae for each natural n - see [9], page 73. Let P = (x1 , ..., xn ). Then S(P ) = P˜ = (˜ x1 , ..., x ˜n+1 ) is given as 4xi x ˜i = 4+s for i = 1, ..., n; x ˜n+1 = s−4 , 4+s n where s := i=1 x2i . As it has been already mentioned, stereographic projection preserves the transformed space dimension and is defined on the whole Rn . Furthermore it is an injective mapping i.e. if u = v, u, v ∈ Rn then S(u) = S(v). However, it transforms a space having infinite measure into space having finite measure. This implied, among others, that points being far from each other in Rn can be closed each to other on S n . Therefore, two clusters which are well separated in Rn can be hardly separated after normalization. However such case can only take place if the clusters are far from the coordinate system origin - then they are transformed near to the north pole of the sphere. Since, in practice, norms of transformed vectors are limited, the minimal distance between clusters after signal normalization can be estimated.
5
Results
The practical case study was performed on data from one of wind turbines. The data covering the period from 11.09.2009 till 30.09.2009 were recorded every 10 minutes by the online monitoring system. The recorded data were current values and were not averaged. The data set included 2869 measurements. As the main goal of the work was to test applicability of ART-type ANNs for classification, in the first step we tried to use the network to achieve results similar to a human
230
T. Barszcz et al.
expert. Other training was not possible, as this type of networks performs only unsupervised learning. The data set contained the most fundamental values, deciding about the operational state of the machine. These were: wind speed - variable x1 , rotational speed of the generator - variable x2 and power generated by the turbine - variable x3 , vertical axis in Figures 4-7. They are related, but only to some extent and in fact they are all independent variables. The selection of variables is the same as the human expert would use. Typically the operation of a wind turbine can be divided in a few distinct states: stopped, transient between 0 and 1000 rotations per minute (rpm), idle load (about 1000 rpm, no load), low power, high power. Sometimes it is not necessary to distinguish all of them and the first two pairs are sometimes regarded as only two states (i.e. ”stopped or transient” and ”low power” including also idle mode). Very important advantage of the chosen set is that it has only 3 variables and can be presented in a graphical way. Thus, it can be easily understood and compared with a human expert. Results create the basis and give some intuition for more advanced research. The main idea of the research was to apply recorded data to the fuzzy-ART network and to investigate what is its behavior i.e. how many states will be created and how does this depends on the network parameters. In all experiment the vigilance parameter ρ belongs to the interval [0, 1]. As it has been already mentioned, this paper is a continuation of studies described in [3]. Neural networks ART-2 and fuzzy-ART were used for classification of a wind turbine operational states. Examples of clustering performed by fuzzy-ART network for vigilance parameter ρ = 0.55 and ρ = 0.7 without transforming input signals using stereographic projection are shown in Fig.4 and Fig.5 respectively. The clasterization is roughly correct but essential drawbacks can be observed. For low value of the vigilance parameter (ρ = 0.55) three classes were created (the fourth one, marked by diamonds, consists of low number of points so it can be neglected). The low power and high power states are classified as a single class (marked by ”x” in Fig.4. The class of idle load - marked by squares in Fig.4 - contains also transient state and, partially, stopped state. Thus it seems that the vigilance parameter should be increased. However, increasing of the vigilance causes creation of a fake class - marked by black ”snowflakes” in Fig.5 Furthermore, the classes representing higher and higher powers - squares, reverse triangles and grey stars in Fig.5 - are not separated clearly. Moreover, the bigger vigilance, the greater number of fake classes. The result of classification by a fuzzy-ART network with stereographic projection used as a preprocessing procedure but without the signal scaling is shown in Fig.6. Though the vigilance parameter was equal to 0.99 which means that it was extremely high, only two classes has been found. It is obvious that the classification is not correct because only small part of data representing stopped cluster has been separated as a distinct class whereas all other data has been rated to the common class. The fact that the data points from R3 has been projected near to the north pole of the sphere S 3 is the reason that, after projection, the clusters has not been able to separate although the vigilance parameter was extremely
Wind Turbines States Classification by a Fuzzy-ART Neural Network
231
1500 1000 500 0 −500 2000
1500
1000
12 10 8
500
6 4 0
2 0
Fig. 4. Input data clasterization by fuzzy-ART neural network without stereographic projection, ρ = 0.55
1500
1000
500
0
−500 2000
1500
1000 12 10
500
8 6 4 0
2 0
Fig. 5. Input data clasterization by fuzzy-ART neural network without stereographic projection, ρ = 0.7
high - see discussion in the end of Section 4. This means that input signals should be scaled before stereographic transformation in order to reduce their distance from the origin of the coordinate system. Thus, the input vector components was x1 x2 x3 scaled using transformation [x1 , x2 , x3 ] → [x˜1 , x˜2 , x˜3 ] = x1M , x2M , x3M . AX AX AX Then x˜i ∈ [0, 1], because xi ≥ 0 and xiMAX > 0, where i ∈ {1, 2, 3}. The result of classification by a fuzzy-ART with input signals transformed using stereographic projection and with the signal scaling is shown in Fig.7. The
232
T. Barszcz et al.
2000
0
−2000 2000
1500
12
1000 10 8 500
6 4 2 0
0
Fig. 6. Input data clasterization by fuzzy-ART neural network with stereographic projection without input signal scaling 1500
1000
500
0
−500 2000 1500 1000 500 0
0
2
4
6
8
10
12
Fig. 7. Input data clasterization by fuzzy-ART neural network with stereographic projection and input signal scaling, ρ = 0.85
vigilance parameter ρ is equal to 0.85 which means that it is high. It can be observed that if scaling and stereographic projection of input signals are done, the vigilance parameter can be increased without appearance of fake classes. The stopped class (grey points), idle load class (circles), low power class (black ”snowflakes”), middle power class (squares) and high power class (stars) are specified correctly and clearly. Interesting case is another class, marked by triangles, describing strong blows of wind when the turbine is stopped. It is very important, because (apart for a small number of cases, when a turbine controller
Wind Turbines States Classification by a Fuzzy-ART Neural Network
233
was not yet able to respond to the wind gust) it signals lost opportunity of energy production. This information is very important to the turbine operator. The only drawback of the proposed method of classification is that the transition between the stopped state and the idle load state has not been detected as a separate cluster. However, it should be stressed that it is extremely difficult task, because the density of data points in this class is far lesser than in other ones. Moreover, these states are very short, do not contribute to the diagnostics and are typically discarded by monitoring systems. It is worth to mention that this behavior is very much different from monitoring of large conventional power generators, where this operational state is the source of very important diagnostic information.
6
Concluding Remarks
Presented results belong to a broader research activity, aimed at automatic monitoring of rotating machinery. We are interested in investigation of several approaches, which can be applied in the engineering practice. Thus, one has to assume that learning sets are not available or cover only a part of machine states. The problem becomes much more the classification of the current state to one of previously known states or detection of a new state. Ideally, such a new state should be included for further classifications. It was shown that ART-2 networks, both classical one and fuzzy, were capable to classify typical states of a wind turbine roughly correctly [3]. Results presented in this paper shows that scaling of the input signals and transforming them using stereographic projection causes significant improvement of data clustering. The fuzzy-ART neural network with the mentioned input signals preprocessing properly allocated classes corresponding to stopped state, idle load state, low, middle and high generated power states. Moreover, the state corresponding to strong wind blows when the turbine is stopped was distinguished as well. Such classification is practically identical to the one done by a human expert.
References 1. Banakar, H., Ooi, B.T.: Clustering of wind farms and its sizing impact. IEEE Transactions on Energy Conversion 24, 935–942 (2009) 2. Bielecki, A., Bielecka, M., Chmielowiec, A.: Input signals normalization in Kohonen neural networks. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 3–10. Springer, Heidelberg (2008) 3. Barszcz, T., Bielecki, A., W´ ojcik, M.: ART-type artificial neural networks applications for classification of operational states in wind turbines. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010. LNCS (LNAI), vol. 6114, pp. 11–18. Springer, Heidelberg (2010) 4. Carpenter, G.A., Grossberg, S.: A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing 37, 54–115 (1987)
234
T. Barszcz et al.
5. Carpenter, G.A., Grossberg, S.: ART2: self-organization of stable category recognition codes for analog input pattern. Applied Optics 26, 4919–4930 (1987) 6. Carpenter, G.A., Grossberg, S., Rosen, D.B.: Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks 4, 759–771 (1991) 7. Ezio, S., Claudio, C.: Exploitation of wind as an energy source to meet the worlds electricity demand. Wind Engineering 74-76, 375–387 (1998) 8. Frank, T., Kraiss, K.F., Kuhlen, T.: Comparative analysis of fuzzy ART and ART2A network clustering performance. IEEE Transactions on Neural Networks 9, 544–559 (1998) 9. Gancarzewicz, J.: Differential Geometry, PWN, Warszawa (1987) (in Polish) 10. Georgiopoulos, M., Fernlund, H., Bebis, G., Heileman, G.L.: Order of search in fuzzy ART and fuzzy ARTMAP: effect of the choice parameter. Neural Networks 9, 1541–1559 (1996) 11. Hameeda, Z., Honga, Y.S., Choa, T.M., Ahnb, S.H., Son, C.K.: Condition monitoring and fault detection of wind turbines and related algorithms: A review. Renewable and Sustainable Energy Reviews 13, 1–39 (2009) 12. Huang, J., Georgiopoulos, M., Heileman, G.L.: Fuzzy ART properties. Neural Networks 8, 203–213 (1995) 13. Paska, J., Salek, M., Surma, T.: Current status and perspectives of renewable energy sources in Poland. Renewable and Sustainable Energy Reviews 13, 142–154 (2009)
Binding and Cross-Modal Learning in Markov Logic Networks Alen Vrečko, Danijel Skočaj, and Aleš Leonardis Faculty of Computer and Information Science, University of Ljubljana, Slovenia {alen.vrecko,danijel.skocaj,ales.leonardis}@fri.uni-lj.si
Abstract. Binding — the ability to combine two or more modal representations of the same entity into a single shared representation is vital for every cognitive system operating in a complex environment. In order to successfully adapt to changes in an dynamic environment the binding mechanism has to be supplemented with cross-modal learning. In this paper we define the problems of high-level binding and crossmodal learning. By these definitions we model a binding mechanism and a cross-modal learner in a Markov logic network and test the system on a synthetic object database. keywords:Binding, Cross-modal learning, Graphical models, Markov logic networks, Cognitive systems.
1
Introduction
One of the most important abilities of any cognitive system operating in a real world environment is to be able to relate and merge information from different modalities. For example, when hearing a sudden, unexpected sound, humans automatically try to visually locate its source in order to relate the audio perception of the sound to the visual perception of the source. The process of combining two or more modal representations (grounded in different sensorial inputs) of the same entity into a single multimodal representation is called binding. While the term binding has many different meanings across various scientific fields, a very similar definition comes from neuroscience, where it denotes the ability of the brain to converge perceptual data processed in different brain parts and segregate it into distinct elements [2] [14]. The binding process can operate on different types and levels of cues. In the above example the direction that the human perceives the sound from is an important cue, but sometimes this is not enough. If there are several potential sound sources in the direction of the percept, the human may have to relate higher level audio and visual properties. A knowledge base that associates the higher level perceptual features across different modalities is therefore critical for a successful binding process in any cognitive system. In order to function properly in a dynamic environment, a cognitive system should also be able to learn and adapt in a continuous, open-ended manner. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 235–244, 2011. c Springer-Verlag Berlin Heidelberg 2011
236
A. Vrečko, D. Skočaj, and A. Leonardis
The ability to update the cross-modal knowledge base online, i. e. cross-modal learning, is therefore vital for any kind of binding process in such an environment. Many of the past attempts at binding information within cognitive systems were restricted to associating linguistic information to lower level perceptual information. Roy et al. tried to ground the linguistic descriptions of objects and actions in visual and sound perceptions and to generate descriptions of previously unseen scenes based on the accumulated knowledge [12] [13]. This is essentially a symbol grounding problem first defined by Harnad [6]. Chella et al. proposed a three-layered cognitive architecture around the visual system with the middle, conceptual layer bridging the gap between linguistic and sub-symbolic (visual) layers [4]. Related problems were also often addressed by Steels [15]. Jacobsson et al. approached the binding problem in a more general way [8] [7] developing a cross-modal binding system that could form associations between multiple modalities and could be part of a wider cognitive architecture. The cross-modal knowledge was represented as a set of binary functions comparing binding attributes in pair-wise fashion. A cognitive architecture using this system for linguistic reference resolution was presented in [16]. This system was capable of learning visual concepts in interaction with a human tutor. Recently, a probabilistic binding system was developed within the same group that encodes cross-modal knowledge into a Bayesian graphical model [17]. The need for a more flexible, but still probabilistic representation of cross-modal knowledge led our reasearch efforts in the direction of Markov graphical models, as presented in this paper. In the next section we define the problems of cross-modal learning and binding. In section 3 we first briefly describe the basics of Markov logic networks (MLNs). Then we desribe our binding and cross-modal learning model that is based on MLNs. Section 4 describes the experiments performed on the prototype system. We end the paper with the conclusion and future work.
2
The Problem Definition
The main idea of cross-modal learning is to use successful bindings of modal percepts as learning samples for the cross-modal learner. The improved crossmodal knowledge thus enhances the power of the binding process, which is then able to bind together new combinations of percepts, i. e. new learning samples for the learner. For example, if a cognitive system is currently capable of binding an utterance describing something blue and round to a perceived blue ball only by color association, this particular instance of binding could teach the system to also associate the visual shape of the ball to the linguistic concept of roundness. We see that at least on this level the binding process depends on the ability to associate between modal features (in this example the visual concepts of blue and shape are features of the visual modality, while the linguistic concepts of blue and ball belong to the linguistic modality). We assume an open world in terms of modal features (new features can be added, old retracted). The cross-modal learner starts with just a some basic
Binding and Cross-Modal Learning in Markov Logic Networks
237
prior knowledge of how to associate between a few basic features, which is then gradually expanded to other features and the new ones that are created. The high-level cross-modal learning problem is closely related to the association rule learning problem in data mining, which was first defined by Agrawal et al. [1]. Therefore, we will base our learning problem definition on Agrawal’s definition and expand it with the notion of modality. We have a set of n binary attributes called features F = {f1 , f2 , ..., fn } and a set of rules called a knowledge database K = {t1 , t2 , ..., tm }. A rule is defined as an implication over two subset of features: ti : X ⇒ Y
(1)
where X, Y ⊆ F and X∩Y = ∅. The features represent various higher level modal properties based on the sensorial input. We introduce the notion of modality to our definition — each feature is restricted to one modality only: M1 = {f11 , f12 , ..., f1n1 } M2 = {f21 , f22 , ..., f2n2 } .. .. .. .. .. ..... Mk = {fk1 , fk2 , ..., fknk }
(2)
F = M1 ∪ M2 ∪ ... ∪ Mk . We modify the rule-making restrictions of (1) accordingly: 1. N = Mm1 ∪ Mm2 ∪ ... ∪ Mmr , m1 , ..., mr ∈ {1, 2, ..., k} , r < k 2. Y ⊆ N 3. X ⊆ F \ N
(3)
Next, we define the binding problem. Percepts are collections of features from a single modality. A percept acts as modal representation of a percieved entity. Let P be the set of current percepts, i. e. the percept configuration: P = {P1 , P2 , ..., Pn } , Pi ⊆ Mj .
(4)
Percept unions are collections of percepts from different modalities. A percept union acts as shared representation of a percieved entity, grounded through its percepts to different modalities. Given a percept configuration P, U(P) is the set of current percept unions, i. e. the union configuration: U(P) = {U1 , U2 , ..., Um } , Ui ⊆ P.
(5)
The binding process is then defined as a mapping between a percept configuration and one of the possible union configurations: β : P → U(P),
(6)
238
A. Vrečko, D. Skočaj, and A. Leonardis
where the following restrictions apply: 1. U1 ∪ U2 ∪ ... ∪ Um = P 2. ∀Ui , Uj ∈ U(P), i = j : Ui ∩ Uj = ∅ 3. ∀Pi , Pj ∈ Uk , i = j : Pi ⊆ Ml ∧ Pj ⊆ Mm ⇒ l = m.
(7)
The first two restrictions assign each percept in the configuration to exactly one union, while the third restricts the number of percepts per modality in an union to one. To make the binding process plausible, we also introduce a measure of confidence in a union configuration based on the knowledge K – the binding confidence bconfK (U). Given the percept configuration P and the current knowledge base K the task of the binding process is to find the optimal union configuration: Uopt (P) = argmax(bconfK (U(P))).
(8)
U(P)
In this sense — i. e. considering bconfK (U) as a predictor based on K — we can consider high-level cross-modal learning as a regression problem. Therefore, the aim of the cross-modal learner is to maintain and improve the cross-modal knowledge base, thus providing an increasingly more reliable measure of binding confidence.
3
Implementation in MLN
Markov logic networks1 [10] combine first-order logic and probabilistic graphical models in a single representation. An MLN knowledge base consists of a set of first-order logic formulae (rules) with a weight attached: weight first-order logic formula The weight is a real number, which determines how strong a constraint each rule is: the higher the weight — the less likely the worldis to violate that rule. Together with a finite set of constants the MLN defines a Markov network (MN) (or Markov random field). An MN is an undirected graph where each possible grounding of a predicate (all predicate variables replaced with constants) represents a node, while the rules define the edges between the nodes. Each rule grounding defines a clique in the graph. An MLN can thus be viewed as a template for constructing the MN. The probability distribution over possible worlds x defined by an MN is given by P (X = x) = 1
1 exp wi ni (x) , Z i
(9)
We used Alchemy [9] to implement the prototype of our crossmodal binding and cross-modal learning mechanisms. Alchemy is a software package providing various inference and learning algorithms based on MLN.
Binding and Cross-Modal Learning in Markov Logic Networks
239
where ni (x) is the number of true groundings of the rule i, wi is the weight of the rule i, while Z is the partition function defined as Z = x exp i wi ni (x) . The inference in MN is a P#-complete problem [11]. Methods for approximating the inference include various Markov Chain Monte Carlo sampling methods [5] and belief propagation [18]. 3.1
Cross-Modal Knowledge Base
We have two types of templates for the binding rules in our cross-modal knowledge base. The template for the aggregative rule is defined as perPart(p1 , f1 ) ∧ uniPart(u, p1 ) ∧ perPart(p2 , f2 ) ⇒ uniPart(u, p2 ),
(10)
where the predicate perPart(percept, f eature) denotes that the feature f eature is part of the percept percept, while uniPart(union, percept) denotes that union includes percept. In a very similar manner the template for the segregative rule is defined: perPart(p1 , f1 ) ∧ uniPart(u, p1 ) ∧ perPart(p2 , f2 ) ⇒ ¬uniPart(u, p2 ).
(11)
The aggregative rules are used to merge the percepts into common percept unions, while the segregative rules separate them in distinct unions. The template rules are equivalent to a subset of association rules (1), where each side is limited to one feature. We also define the binding domain that we will use to ground the network. An example binding domain with two modalities is: modality = {Language, V ision} f eature = {Red, Green, Blue, Compact, F lat, Elongated, Color1, Color2, Color3, Shape1, Shape2, Shape3}.
(12)
Based on this example domain a small set of grounded and weighted binding rules could look like this: 2.5
perPart(p1 , Red) ∧ uniPart(u, p1 ) ∧ perPart(p2 , Color1) ⇒ uniPart(u, p2 )
1.9 perPart(p1 , Red) ∧ uniPart(u, p1 ) ∧ perPart(p2 , Color2) ⇒ ¬uniPart(u, p2 ), (13)
The predicates forming the binding rules are not fully grounded yet. They are grounded on the conceptual level only, with known features like Red, Color1, etc., while the instances (objects) are still represented with variables. The predicates get fully grounded each time an inference is performed, when based on the current situation (e. g. perceived objects that form the scene) an MN is constructed. This principle could be very beneficial for a cognitive system, since while decoupling the general from the specific, it allows for the application and adaptation of general concepts learned over longer periods of time to the current situation in a very flexible fashion.
240
A. Vrečko, D. Skočaj, and A. Leonardis
Using the example domain in (12) an example percept configuration could look like perPart(1, Color1) ∧ perPart(1, Shape2) ∧ perPart(2, Color2) ∧ perPart(2, Shape3) ∧ perPart(3, Red).
(14)
From (13) and (14) we could infer the following union configuration: uniPart(1, 1) ∧ uniPart(1, 3) ∧ uniPart(2, 3). Besides the binding rules, our database also contains feature priors in the following form: weight perPart(p, ColorGrounding) A feature’s prior denotes the default probability of a feature belonging to a percept (if there is no positive or negative evidence about it). In addition, we use a special predicate to determine the partition of features between modalities in the sense of (2) (e. g. modP art(Language, Red), modP art(V ision, Color1). 3.2
Learning
After the rules and priors are grounded within the binding domain, we need to learn their weights. We use the generative learning method described in [10]. The learner computes a gradient from the weights based on the number of true groundings (ni (x)) in the learning database and the expected true groundings according to the MLN (Ew [ni (x)]): δ log Pw (x) = ni (x) − Ew [ni (x)], δwi
(15)
and optimizes the weights accordingly. Since the expectations Ew [ni (x)] are very hard to compute, the method uses the pseudo-likelihood to approximate it [3]. Continuous learning is performed by feeding the learning samples to the system in small batches (3-6 percept unions). Each mini batch represents a scene the system has resolved, described with perPart and uniPart predicates. In each learning step the learner accepts the rule’s old weight in the knowledge database as the mean for the Gaussian prior, which it tries to adjust based on the new training mini batch. By setting the dispersion of the weight’s Gaussian prior to an adequate value, we ensure the learning rate of each mini batch is proportional to the batch size. 3.3
The Binding Process
The binding process translates to inferring over the knowledge base based on some evidence. In order for the binding inference to function properly we have
Binding and Cross-Modal Learning in Markov Logic Networks
241
to define some hard rules (formulae with infinite weight) that apply the binding restrictions in (7): 1. ∀p∃u : uniPart(u, p) 2. uniPart(u1, p) ∧ uniPart(u2, p) ⇒ u1 = u2 3. perPart(p1, f 1) ∧ perPart(p2, f 2) ∧ (p1 = p2) ∧ modP art(m, f 1) ∧ modP art(m, f 2) ∧ uniPart(u, p1) ⇒ ¬uniPart(u, p2). Usually the inference consists of querying for the predicate uniPart, where the evidence typically includes the description of the current percept configuration (using the predicate perPart), a list of known and potential unions and the description of the current partial union configuration (some percepts are already assigned to known unions). The binding result is expressed as a probability distribution for each unassigned percept over the known and potential unions.
4
Experimental Results
We experimented with our system on a database of 42 synthetic objects. Objects had percepts from three modalities: vision, language and affordance. The visual modality had 13 features in total: 6 for object color and 7 for the shape. Language had 13 features matching the visual features and 8 features for object type (e. g. book, box, apple). The affordance modality had 3 features describing the possible outcomes of pushing an object. Mini batches were designed to mimic robot interaction with a human tutor, where the tutor showed objects to the robot, describing their properties. Typically a mini batch contained 5-6 objects. The learning sequence consisted of 80 mini batches. We designed 30 test-cases for evaluating the binding process. In each testcase we had three visual percepts and one non-visual percept. The binder had to determine which visual percept, if any at all, the non-visual percept belonged to (i. e. four possible choices: one for each visual percept and one for no corresponding percept). Of the four possible choices there was always one that was union = {1, 2, 3, 4} perPart(1, V isRed), perPart(1, V isF lat), perPart(1, V isCylindrical) perPart(2, V isBlue), perPart(2, V isCompact), perPart(2, V isSpherical) perPart(3, V isGreen), perPart(3, V isElongated), perPart(3, V isConical) uniPart(1, 1), uniPart(2, 2), uniPart(2, 2) perPart(4, LinRed), perPart(4, LinF lat), perPart(4, LinCylindrical) uniPart(u, 4)? Fig. 1. An example of an easy test-case. We can see that objects represented with visual percepts (1,2 and 3) differ in all types of visual features. The system needs to determine which union the fourth, linguistic percept belongs to.
242
A. Vrečko, D. Skočaj, and A. Leonardis
union = {1, 2, 3, 4} perPart(1, V isRed), perPart(1, V isCompact), perPart(3, V isConical) perPart(2, V isGreen), perPart(2, V isCompact), perPart(2, V isSpherical) perPart(3, V isGreen), perPart(3, V isF lat), uniPart(1, 1), uniPart(2, 2), uniPart(2, 2) perPart(4, LinApple) uniPart(u, 4)? Fig. 2. An example of a difficult test-case. We can see that the objects represented with visual percepts (1,2 and 3) are less distinct than in the easier test-case (fig. 1) and with some incomplete information. The system has to find out which visual percept could be an apple. The visual training samples for apples consisted of compact and spherical percepts of red or green color.
more obvious than the others and deemed correct. The possibility that the system inferred as the most probable, was considered to be its binding choice. The test-cases varied in their level of difficulty: the easiest featured distinct features for visual percepts and complete information for all percepts (all percepts had a value for each feature type belonging to its modality, see figure 1), while more difficult cases could have features shared by more percepts and incomplete percept information (see figure 2). The tests were performed several times during the learning process in intervals of four batches.
Fig. 3. Experimental results: the average rate of correct binding choices relative to the number of training batches (10 randomly generated learning sequences were used). The green, yellow and red lines denote the easy, medium and hard test samples respectively, while the blue line denotes the overall success percentage.
Binding and Cross-Modal Learning in Markov Logic Networks
243
Figure 3 shows the average success rate over 10 randomly generated learning sequences. We see that with the growing number of samples the binding rate tends to grow and converge, though with some oscillations. The oscillations tend to be more pronounced for the difficult samples. Analyzing the results example by example we saw that the test-cases with the most oscillations were the ones that depended on many-to-one feature associations (e. g. red, compact, cylindrical ⇒ cola can). This can be explained with the current structure of our binding rules that directly support one-to-one feature associations only.
5
Conclusion
In this paper we defined the problems of high-level binding and cross-modal learning. By these definitions we modeled a binding mechanism and a crossmodal learner in MLNs. We tested the system on a synthetic object database and showed how the binding power of the system increases with the number of learned samples. In the future we will apply our binding and cross-modal learning models to a real cognitive architecture that includes visual and communication subsystems, thus gaining a platform for experiments on real-world data. We will also extend the structure of our database to more complex rules (or perhaps include a structure learning mechanism to our system) and improve and extend our experiments to better simulate the robot-tutor interaction.
Acknowledgment This work was supported by the EC FP7 IST project CogX-215181.
References 1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proc. of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., pp. 207–216 (May 1993) 2. Bartels, A., Zeki, S.: The temporal order of binding visual attributes. Vision Research 46(14), 2280–2286 (2006) 3. Besag, J.: Statistical analysis of non-lattice data. Journal of the Royal Statistical Society. Series D (The Statistician) 24(3), 179–195 (1975) 4. Chella, A., Frixione, M., Gaglio, S.: A cognitive architecture for artificial vision. Artif. Intell. 89(1-2), 73–111 (1997) 5. Gilks, W.R., Spiegelhalter, D.J.: Markov chain Monte Carlo in practice. Chapman & Hall/CRC (1996) 6. Harnad, S.: The symbol grounding problem. Physica D: Nonlinear Phenomena 42, 335–346 (1990) 7. Jacobsson, H., Hawes, N., Kruijff, G.-J., Wyatt, J.: Crossmodal content binding in information-processing architectures. In: Proc. of the 3rd ACM/IEEE International Conference on Human-Robot Interaction, Amsterdam (March 2008)
244
A. Vrečko, D. Skočaj, and A. Leonardis
8. Jacobsson, H., Hawes, N., Skočaj, D., Kruijff, G.-J.: Interactive learning and crossmodal binding - a combined approach. In: Symposium on Language and Robots, Aveiro, Portugal (2007) 9. Kok, S., Marc Sumner, M., Richardson, M., Singla, P., Poon, H., Lowd, D., Wang, J., Domingos, P.: The alchemy system for statistical relational ai. Technical report, Department of Computer Science and Engineering, University of Washington, Seattle, WA (2009) 10. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1-2), 107– 136 (2006) 11. Roth, D.: On the hardness of approximate reasoning. Artif. Intell. 82(1-2), 273–302 (1996) 12. Roy, D.: Learning visually-grounded words and syntax for a scene description task. Computer Speech and Language 16(3-4), 353–385 (2002) 13. Roy, D.: Grounding words in perception and action: computational insights. TRENDS in Cognitive Sciences 9(8), 389–396 (2005) 14. Singer, W.: Consciousness and the binding problem. Annals of the New York Academy of Sciences 929, 123–146 (2001) 15. Steels, L.: The Talking Heads Experiment. Words and Meanings, vol. 1. Laboratorium, Antwerpen (1999) 16. Vrečko, A., Skočaj, D., Hawes, N., Leonardis, A.: A computer vision integration model for a multi-modal cognitive system. In: Proc. of the 2009 IEEE/RSJ Int. Conf. on Intelligent RObots and Systems, St. Louis, pp. 3140–3147 (October 2009) 17. Wyatt, J., Aydemir, A., Brenner, M., Hanheide, M., Hawes, N., Jensfelt, P., Kristan, M., Kruijff, G.-J., Lison, P., Pronobis, A., Sjöö, K., Skočaj, D., Vrečko, A., Zender, H., Zillich, M.: Self-understanding & self-extension: A systems and representational approach (2010) (accepted for publication) 18. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding belief propagation and its generalizations. Morgan Kaufmann Publishers Inc., San Francisco (2003)
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents in Nondeterministic Environments Akram Beigi, Nasser Mozayani, and Hamid Parvin School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran {Beigi,Mozayani,Parvin}@iust.ac.ir
Abstract. In reinforcement learning exploration phase, it is necessary to introduce a process of trial and error to discover better rewards obtained from environment. To this end, one usually uses the uniform pseudorandom number generator in exploration phase. However, it is known that chaotic source also provides a random-like sequence similar to stochastic source. In this paper we have employed the chaotic generator in the exploration phase of reinforcement learning in a nondeterministic maze problem. We obtained promising results in the so called maze problem. Keywords: Reinforcement Learning, Evolutionary Q-Learning, Chaotic Exploration.
1 Introduction In reinforcement learning, agents learn their behaviors by interacting with an environment [1]. An agent senses and acts in its environment in order to learn to choose optimal actions for achieving its goal. It has to discover by trial and error search how to act in a given environment. For each action the agent receives feedback (also referred to as a reward or reinforcement) to distinguish what is good and what is bad. The agent’s task is to learn a policy or control strategy for choosing the best set of actions in such a long run that achieves its goal. For this purpose the agent stores a cumulative reward for each state or state-action pair. The ultimate objective of a learning agent is to maximize the cumulative reward it receives in the long run, from the current state and all subsequent next states along with goal state. Reinforcement learning systems have four main elements [2]: policy, reward function, value function and model of the environment. A policy defines the behavior of learning agent. It consists of a mapping from states to actions. A reward function specifies how good the chosen actions are. It maps each perceived state-action pair to a single numerical reward. In value function, the value of a given state is the total reward accumulated in the future, starting from that state. The model of the environment simulates the environment’s behavior and may predict the next environment state from the current state-action pair and it is usually represented as a Markov Decision Process (MDP) [1, 3, and 4]. In MDP A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 245–253, 2011. © Springer-Verlag Berlin Heidelberg 2011
246
A. Beigi, N. Mozayani, and H. Parvin
Model, The agent senses the state of the world then takes an action which leads it to a new state. The choice of the new state depends on the agent’s current state and its action. An MDP is defined as a 4-tuple <S,A,T,R> characterized as follows: S is a set of states in environment, A is the set of actions available in environment, T is a state transition function in state s and action a, R is the reward function. The optimal solution for an MDP is that of taking the best action available in a state, i.e. the action that collected as much reward as possible over time. In reinforcement learning, it is necessary to introduce a process of trial and error to maximize rewards obtained from environment. This trial and error process is called an environment exploration. Because there is a trade-off between exploration and exploitation, balancing of them is very important. This is known as the explorationexploitation dilemma. The schema of the exploration is called a policy. There are many kinds of policies such as ε-greedy, softmax, weighted roulette and so on. In these existing policies, exploring is decided by using stochastic numbers as its random generator. It is ordinary to use the uniform pseudorandom number generator as the generator employed in exploration phase. However, it is known that chaotic source also provides a random-like sequence similar to stochastic source. Employing the chaotic generator based on the logistic map in the exploration phase gives better performances than employing the stochastic random generator in a nondeterministic maze problem. Morihiro et al. [5] proposed usage of chaotic pseudorandom generator instead of stochastic random generator in an environment with changing goals or solution paths along with exploration. That algorithm is severely sensitive to ε in ε-greedy. It is important to note that they don’t use chaotic random generator in nondeterministic environments. In that work, it can be inferred that stochastic random generator has better performance in the case of using random action selection instead of ε-greedy one. On the other hand, because of slowness in learning by reinforcement learning, evolutionary computation techniques are applied to improve learning in nondeterministic environments. In this work we propose a modified reinforcement learning algorithm by applying population-based evolutionary computation technique and an application of the random-like feature of deterministic chaos as the random generator employed in its exploration phase, to improve learning in multi task agents. To sum up, our contributions are: •
Employing evolutionary strategies to reinforcement learning algorithm in support of increasing performance both in speed and accuracy of learning phase,
•
Usage of chaotic generator instead of uniform pseudorandom number generator in the exploration phase of evolutionary reinforcement learning,
•
Multi task learning in nondeterministic environments.
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents
247
2 Chaotic Exploration Chaos theory studies the behavior of certain dynamical systems that are highly sensitive to initial conditions. Small differences in initial conditions (such as those due to rounding errors in numerical computation) result in widely diverging outcomes for chaotic systems, and consequently obtaining long-term predictions impossible to take in general. This happens even though these systems are deterministic, meaning that their future dynamics are fully determined by their initial conditions, with no random elements involved. In other words, the deterministic nature of these systems does not make them predictable if the initial condition is unknown [6, 7]. As it is mentioned, there are many kinds of exploration policies in the reinforcement learning, such as ε-greedy, softmax, weighted roulette. It is common to use the uniform pseudorandom number as the stochastic exploration generator in each of the mentioned policies. There is another way to deal with the problem of exploration generators which is to utilize chaotic deterministic generator as their stochastic exploration generators [5]. As the chaotic deterministic generator, a logistic map which generates a value in the closed interval [0 1] according to equation 1, is used as stochastic exploration generators in this paper. xt+1 = alpha xt(1 − xt).
(1)
In equation 1, x0 is a uniform pseudorandom generated number in the [0 1] interval and alpha is a constant in the interval [0 4]. It can be showed that sequence xi will be converged to a number in the [0 1] interval provided that the coefficient alpha be a number near to and below 4 [8, 9]. It is important to note that the sequence may be divergent for the alpha greater than 4. The closer the alpha to 4, the more different convergence points of the sequence. If alpha is selected 4, the vastest convergence points (maybe all points in the [0 1] interval) will be covered per different initializations of the sequence. So here alpha is chosen 4 to making the output of the sequence as similar as to uniform pseudorandom number.
3 Population Based Evolutionary Computation One of research has done in Evolutionary Computation introduced by Handa [10]. It has used a kind of memory in Evolutionary Computation for storing past optimal solutions. In that work, each individual in population denotes a policy for a routine task. The best individual in current population is selected to insert in archive as environmental is changed. After that individuals in the archive are randomly selected to be moved into the population. The algorithm is called Memory-based Evolutionary Programming which is depicted in Fig 1. A large number of studies concerning dynamic or uncertain environments have been made; have used Evolutionary Computation algorithms [11]. These problems try to reach their goal as soon as possible. The significant issue is that the robots could get assistance from their previous experiences. In this paper a population based chaotic evolutionary computation for multitask reinforcement learning problems is examined.
248
A. Beigi, N. Mozayani, and H. Parvin
Fig. 1. Handa algorithm’s Diagram for evolutionary computation
4 Q-Learning Among reinforcement learning algorithms, Q-learning method is considered as one of the most important algorithms [1]. It consists of a Q-mapping from state-action pairs by rewards obtained from the interaction with the environment. In this case, the learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of the policy being followed. This simplifies the analysis of the algorithm and enabled early convergence proofs. The pseoduecode of Q-learning algorithm is shown in Fig 2. Q-Learning Algorithm: Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe r, s’
Q ( s, a ) ← Q ( s, a ) + α [r + γ max a ' Q ( s ' , a ' ) − Q ( s, a )] s ← s' Until s is terminal Fig. 2. Q- Learning Algorithm
5 Evolutionary Reinforcement Learning Evolutionary Reinforcement Learning (ERL) is a method of probing the best policy in RL problem by applying GA. In this case, the potential solutions are the policies and are represented as chromosomes, which can be modified by genetic operators such as crossover and mutation [12].
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents
249
GA can directly learn decision policies without studying the model and state space of the environment in advance. The fitness values of different potential policies are used by GA. In many cases, fitness function can be computed as the sum of rewards, which are used to update the Q-values. We use a modified Q-learning algorithm with applying memory-based Evolutionary Computation technique for improving learning in multi task agents [13].
6 Chaotic Based Evolutionary Q-Learning With applying Genetic Algorithms for reinforcement learning in nondeterministic environments, we propose a Q-learning method called Evolutionary Q-learning. The algorithm is presented in Fig 3. Chaotic Based Evolutionary Q Learning (CEQL): Initialize Q(s,a) by zero Repeat (for each generation): Repeat (for each episode): Initialize s Repeat (for each step of episode): Initiate(Xcurrent) by Rnd[0,1] Repeat Xnext=4 * Xcurrent * (1- Xcurrent) Until (Xnext - Xcurrent <ε) Choose a from s using Xnext Take action a, observe r, s’
s ← s' Until s is terminal Add visited path as a Chromosome to Population Until population is complete Do Crossover() by CRate Evaluate the created Childs Do tournament Selection() Select the best individual for updating Q-Table as follows:
Q(s, a) ←Q(s, a) + α[r + γ maxa' Q(s' , a' ) − Q(s, a)] Copy the best individual in next population Until satisfying convergence Fig. 3. Chaotic base evolutionary Q- Learning Algorithm
250
A. Beigi, N. Mozayani, and H. Parvin
7 Simulation and Results 7.1 Nondeterministic Maze Task Assume that a number of robots are working in a mine and their task is to search for gold from an initial point. The mine has a group of corridors which robots can pass through them. In specific paths there exist some barriers which do not let robots to continue. Now, suppose that because of decadent corridors, it is possible that in some places, there could be some pits. If a robot enters to such pits, it may be not to able exit from that pit with probability above zero by some moves. If it fails to exit by the moves, it has to try again. The aim of robots is finding the gold state as soon as possible. Note that the robots can use their past experiences. In such problems, the shortest path may be not the best path; because it may have some pit cells and it leads agent acts many movement until finding goal state. Therefore the optimum path has less pit cells and short length. So applying an evolutionary version of Q-learning is more useful. For validation of the proposed algorithm we turn to a modified version of Sutton's maze problem which is depicted in Fig 4. The original Sutton's maze problem consists of 6 x 9 cells, 46 common states, 1 goal state and 7 collision cells (gray cells in Fig 4). An agent can occupy each common state. It can’t pass through collision cells. For each agent, there are at most four actions in each state to take: Up, Down, Left, and Right. The original Sutton's Maze problem is a deterministic problem [1]. In nondeterministic version of the problem one adds a number of probabilistic cells or holes (hachured cells in Fig 4). An agent can’t leave each of the holes by its taken actions certainly and it is probable for them to remain in their previous states. That is, they have to stay the same cell with above zero probability; this probability is sampled from Normal distribution with average = 0, and variance = 1. For example, if an agent take Left action in a hole with transition probability according to Fig 5 a, next state may be the same state with probability of 0.6. Agents will gain a reward +1, if they reach goal state. All other states don’t give any reward to agents. There is no punishment in the problem. 7.2 Actions Each action of agent could be an MDP sample such as delineated in Fig. 5. Right part shows certain case which in it agents move to next state by choosing any possible action with probability equal 1. On the other hand left part reveals that it is possible that the agents cannot move to next state by choosing any possible action and it would remain in its position. For some do actions. These MDPs presented to learning algorithm sequentially. The presentation time of each problem instance is enough to learn. The problem is to maximize the total acquired rewards for lifespan. Agents return to start state after arriving goal state.
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents
251
G
S
Fig. 4. Modified Sutton's maze
0.8
1
0.2 0.6 0.4
0.5 0.5
0.1
1 1
0.9
1
Fig. 5. Actions MDP Models
7.3 Experimental Results A single experiment is composed of 100 tasks, where each task consists of 100 generations. In this work 50 experiments have been done. Hence, 500,000 generations in 50 experiments have been executed. In the case of Evolutionary Q-Learning and Chaotic Evolutionary Q-Learning, The sizes of population are set to 100. Table 1 summarizes experiment results. As it can be inferred from the Table 1, usage of chaotic generator instead of stochastic random generator results in improvements, both in original Q-learning and evolutionary Q-learning in terms of averaged path length. Usage of chaotic generator in Original Q-Learning results in 6.85% improvement. Also the usage of it in the Evolutionary Q-Learning results in 5.09% improvement. This isn't unexpected, because it is explored in [5] before and the superior of chaotic generator over stochastic random generator in exploration phase of reinforcement learning has been shown. But here, as it is reported in [12] the evolutionary reinforcement learning can improve the average found paths significantly comparing with the average paths found by original version of reinforcement learning. Also it is shown here that the performance of Evolutionary Q-Learning can precede Chaotic based Q-learning. So it can be expected that the employing of chaotic generator instead of stochastic random generator in evolutionary reinforcement learning can also yields to a better performance. As it is reported in the table 1, the chaotic based evolutionary reinforcement learning can improve the average found paths comparing with
252
A. Beigi, N. Mozayani, and H. Parvin Table 1. Experimental Result Best average of path length in population
worst average of path length in population
Total Average of path length
OQL
54.26
1600
449.1149
CQL
60.49
1500.33
418.33
EQL
28.34
66.79
42.4738
CEQL
25.11
99.99
40.3098
53.7%
93.7%
91%
Improvement CEQL to OQL OQL: CQL: EQL: CEQL:
Original Q-learning Chaotic based Q-learning Evolutionary Q-Learning Chaotic based Evolutionary Q-learning
the average paths found by non-chaotic version of evolutionary reinforcement learning by 5.09% expectedly. So it can be concluded that leveraging the chaotic random generator in exploration phase of the reinforcement learning can be effective in a better environment exploration both in its evolutionary base version or its nonevolutionary one.
8 Conclusion In reinforcement learning, exploration phase takes a significant role aiming to fastness in learning period. Different algorithms exist as random number generator in exploration phase. We showed that the deterministic chaotic generator for the exploration gives better performances than the stochastic random generator. For improving the slowness of Q-learning, we applied genetic algorithms and used Population based evolutionary computation along with chaos as random generator in exploration phase to improve efficiency in nondeterministic environment. Sutton’s Maze is a well-known problem in reinforcement learning field and we implement a modified version of it for evaluation proposed algorithm in nondeterministic environment. Our proposed algorithm has about 91.02% improvement compared to original Q-learning algorithm in average case of found path length. Also it has also about 5.09% improvement compared with non-chaotic evolutionary Q-learning in average. This can be inferred from experimental results that employing chaos as random generator in exploration phase as well as evolutionary-based computation have improved the reinforcement learning, both in rate of learning and accuracy. It can also be concluded that using chaos in exploration phase is efficient for both evolutionary based version and non-evolutionary one.
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents
253
References 1. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 2. Cuayáhuitl, H.: Hierarchical Reinforcement Learning for Spoken Dialogue Systems, PhD thesis, University of Edinburgh (2009) 3. Vidal, J.M.: Fundamentals of Multi Agent Systems (2009) (unpublished) 4. Shoham, Y., Leyton-Brown, K.: MULTIAGENT SYSTEMS Algorithmic, GameTheoretic, and Logical Foundations. Cambridge University Press, Cambridge (2009) 5. Morihiro, K., Matsui, N., Nishimura, H.: Effects of Chaotic Exploration on Reinforcement Maze Learning. In: Negoita, M.G., Howlett, R.J., Jain, L.C. (eds.) KES 2004. LNCS (LNAI), vol. 3213, pp. 833–839. Springer, Heidelberg (2004) 6. Kellert, S.H.: In the Wake of Chaos: Unpredictable Order in Dynamical Systems. University of Chicago Press, Chicago (1993) ISBN 0226429768 7. Meng, X.P., Meng, J., Lui, L.J.: Quantum Chaotic Reinforcement Learning. In: Fourth International Conference on Natural Computation (2008) 8. Parker, T.S., Chua, L.O.: Practical Numerical Algorithms for Chaotic Systems. Springer, Heidelberg (1989) 9. Ott, E., Sauer, T., Yorke, J.A.: Coping with Chaos: Analysis of Chaotic Data and the Exploitation of Chaotic Systems. John Wiley & Sons, Inc., New York (1994) 10. Handa, H.: Evolutionary Computation on Multitask Reinforcement Learning Problems. In: IEEE International Conference on Networking, Sensing and Control, pp. 685–688 (2007) 11. Goh, K., Tan, K.: Evolutionary Multi-objective Optimization in Uncertain Environments. Springer, Heidelberg (2009) 12. Jiang, J.: A Framework for Aggregation of Multiple Reinforcement Learning Algorithms. PhD thesis, University of Waterloo (2007) 13. Beigi, A., Parvin, H., Mozayani, N., Minaei, B.: Improving Reinforcement Learning Agents Using Genetic Algorithms. In: An, A., Lingras, P., Petty, S., Huang, R. (eds.) AMT 2010. LNCS, vol. 6335, pp. 330–337. Springer, Heidelberg (2010)
Parallel Graph Transformations Supported by Replicated Complementary Graphs Leszek Kotulski and Adam S¸edziwy AGH University of Science and Technology, Institute of Automatics, al. Mickiewicza 30, 30-059 Krak´ ow, Poland {kotulski,sedziwy}@agh.edu.pl
Abstract. Graph transformations are the powerful formalism allowing describing a behavior of systems of various types. Parallel computations paradigm makes computations faster if we are able to reduce additional costs related to a communication overhead and a complexity of design of such systems. Replicated complementary graphs concept allows a parallel execution of graph transformation rules (designed for the centralized graph case) on a distributed environment. The possibility and the cost of data replication will be considered in the paper in the context of doublepushout approach. Keywords: distributed computing, graph transformations.
1
Introduction
Graph transformations are the powerful modeling framework applicable in many areas of computer systems such as system specification, software generation or task allocation control [12,2,3,4]. Problems with the computational complexity can be overcome by introducing parallel graph transformation mechanisms. On the other side, graph transformations supporting parallelism, like graph multisets [10] or Lindenmayer’s L-system [11] transformations, haven’t efficient implementations of their parsers. The second problem is in difficulty of designing, implementing and maintaining of distributed/parallel systems. The problem lies not only in a weaker intuitiveness of parallel algorithms in comparison to sequential ones but also in new problems like synchronization, deadlock avoidance or optimal data and task allocation. GRADIS framework allows preparing a description of a problem in terms of a centralized graph and next, moving it to the distributed environment composed of complementary graphs. Thus the mentioned sequential centralized description may be translated automatically to a parallel one. We assume that a problem description can be formalized by means of an attributed graph G and a set of its transformation rules, T R. Assuming this we introduce some form of the distributed, partially replicated graphs that are equivalent to G in the sense of a system description. This set of subgraphs Gi will be referred to as Replicated Complementary Graphs (RCGs). The replication of some nodes belonging to subgraphs Gi is necessary to preserve the consistency (equivalence) between distributed (based on RCGs) and centralized (using G) ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 254–264, 2011. c Springer-Verlag Berlin Heidelberg 2011
Graph Transformations Supported by Replicated Complementary Graphs
255
descriptions of given system. Transformation of each subgraph Gi (called also a local one) will be controlled by its Local Graph Transformation Agent with the help of transformation rules that are inherited from T R. The formal background of this approach is introduced and its properties are proved in the paper. In Section 2 we introduce the concept of replicated complementary graphs. In Section 3 we define generic procedures underlying GRADIS framework and prove their polynomial complexity. The detailed description of an application of a graph grammar production in a distributed multiagent environment is presented in Section 4. The example of double pushout production applied in RCGs environment is also discussed. In Section 5 we present final conclusions of the article.
2
Replicated Complementary Graphs
Before we define the notion of RCGs we have to introduce some basic definitions and notations. Definition 1. (Σ v , Σ e )-graph is a triple (V, E, ϕ) where V is nonempty set, E is a subset of V × Σ e × V , and ϕ : V −→ Σ v , Σ v and Σ e denote sets of node and edge labels respectively. We denote the family of (Σ v , Σ e )-graphs as G. For any G = (V, E, ϕ) ∈ G, V is set of nodes, E is set of edges and ϕ is node labeling function. For a given graph G sets V and E will be also denoted as V (G) and E(G) respectively. One can extend this definition e.g. by introducing attributing functions for both, nodes and edges, but such extensions don’t impact the rules of the centralized graph distribution and its transformations so they will not be considered here. We will also use the following notations: – Replicas(Gi ) is a set of all nodes of the graph Gi that are shared with another local graphs, i.e the nodes for which exist any replicas; – Private(Gi ) is a set of all nodes of the graph Gi that belong to Gi only (have no replicas). Note that V (Gi ) = Replicas(Gi ) ∪ Private(Gi ); – v Gi w means that v, w ∈ V (Gi ) and (v, w) ∈ E(Gi ) or (w, v) ∈ E(Gi ); – v G+i w means that there exist u1 = v, u2 , . . . , uk−1 , uk = w ∈ V (Gi ) such that up = uq for p = q and up Gi up+1 for p < k; the sequence of nodes (u1 , u2 , . . . , uk ) is called a proof of an acyclic connection between v and w (abbreviated as PAC); – A set of all PACs between w and v is denoted as Path(Gi , v, w); – A set of nodes belonging to all s ∈ Path(Gi , v, w) is denoted as PNodes(Gi , v, w). Example. For the graph G presented in Figure 1a we have PNodes(G, 4, 7) = {4, 5, 6, 7, 10, 11}. Definition 2. A set of partial graphs Gi = (Vi , Ei , ϕi ), i = 1, 2, . . . k, is a replicated and complementary form of graph G iff there exist sets of nodes Replicas(Gi ),Private(Gi ),Border(Gi ) and the set of injective homomorphisms si : Gi −→ G such that:
256
L. Kotulski and A. S¸edziwy
1. ∀i ∈ {1, . . . k} V (Gi ) = Replicas(Gi ) ∪ Private(Gi ) and Replicas(Gi ) ∩ Private(Gi ) = ∅ 2. i=1,...k si (Gi ) = G 3. ∀i, j ∈ {1, . . . k} : si (Vi ) ∩ sj (Vj ) = si (Replicas(Gi )) ∩ sj (Replicas(Gj )) 4. ∀i ∈ {1, . . . k} Border(Gi ) = {v ∈ Replicas(Gi ) : ∃w ∈ Private(Gi ) and v Gi w} 5. ∀w ∈ Private(Vi ) ∀v ∈ Private(Vj ) ∀(u1 , u2 , ..., uk ) ∈ Path(G, w, v) ∃b ∈ Border(Gi ) : si (b) = up for some 1 < p < k Partial graph Gi is also referred to as a replicated and complementary graph (RCG). The above formal, axiomatic definition may be difficult in a practical use, especially for producing partial graphs. For that reason one has to find some effective algorithm of creating RCGs. In the following theorem we claim that such operation may be performed in a polynomial time. Theorem 1. Let G = (V, E) be a given centralized graph and {Vi }i=1,2,...k be the family of disjoint sets of nodes such that i Vi = V . One can create the set of RCGs in polynomial time. Proof. We assume without loss of generality that for each two nodes u, v ∈ Vi there exists an undirected path p in G such that all nodes of p belong to Vi (otherwise Vi might be split into the smallest subsets). For every (v, λ, w) ∈ E(G) such that v ∈ Vi and w ∈ Vj for some indexes i = j, we replicate the node belonging to the set with the smaller index and add its replica to the set with the greater index. The graph Gi is the restriction of the graph G to the nodes Vi . The nodes being replicated are marked as replicas. The computational complexity of this algorithm is O(|E|) (i.e. O(|V |2 )). It should be remarked that the generated set of replicated nodes is the minimal possible set for which {Gi }i=1,2...k are complementary graphs (i.e. for 1 ≤ i ≤ k Replicas(Gi ) = Border(Gi )).
(a)
(b)
Fig. 1. (a) G in centralized form. (b) {G1 , G2 , G3 } - complementary form of G.
Graph Transformations Supported by Replicated Complementary Graphs
257
We also assume that: – a node index consists of two numbers: the first one denotes an ordinal number of a given partial graph or equals to −1 in a replicated node case and the second one is an unique index inside a particular partial graph or, for border nodes, a unique index in the set of all border nodes, – replicated nodes will be marked graphically by double circles, all replicas of a given node share identical index of the form (−1, j). Example. Let us consider the graph G shown in Figure 1a. To decompose it into the set of RCGs we set V1 = {1, 2, . . . 5}, V2 = {6, 7, . . . 9}, V3 = {10, 11}. After replicating nodes according to the rule described above we get RCGs as shown in Figure 1b. Let us note that replicated nodes labeled by D, E, F have unique indexes in the replicated nodes set, and private nodes indexes are unique inside each partial graph. Theorem 2. Let {Gi }i=1..k be a set of replicated complementary graphs. One can recompose a centralized graph G from {Gi }i=1..k in a polynomial time. Proof. We merge two replicated graphs, M and N , in the following way (Join(M, N ) function): ∀v ∈ Border(M ) such that there exists its replica, w, in N (i.e. w ∈ Border(N )) we replace each the edge (x, w) ∈ E(N ) by the edge (x, v) and each (w, x) ∈ E(N ) by (v, x) respectively. Finally, both graphs are glued in nodes that have been previously common replicas of nodes in M and N . The effect of this operation is the new graph M . The algorithm described above has a polynomial complexity. Its formalized form is presented below. input : G1 , G2 , . . . Gk - the set of complementary partial graphs output: G - centralized graph assembled from G1 , G2 , . . . Gk G ← G1 ; S ← {G2 , G3 . . . Gk }; while S = ∅ do Select Gj ∈ S; S ← S − Gj ; G ←Join G and Gj ; foreach replicated node v and ∀Gx ∈ S v ∈ / Gx do Mark v as a private node (leave only one replica of v) Update indexes in Vj to keep them unambiguous in the context of G
3
Multiagent Environment for Parallel Graph Transformations
RCGs concept is supported by GRADIS multiagent framework where each agent controls one local graph Gi . Following the FIPA [14] specification [15] we assume a very simple functionality of a multiagent environment, reduced to a message transport and a broker system. This approach is similar to those applied in popular frameworks like JADE [16] or Retsina [17]. Low level message passing
258
L. Kotulski and A. S¸edziwy
mechanisms and agents synchronization are hidden inside following GRADIS operations: SplitGraph, JoinGraphs, Incorporate, Expand, Neighborhood. Two first operations support creating agents and a recomposition of a centralized graph by gathering information maintained by agents. SplitGraph(G01 ,VH ) operation. G01 is a given graph and VH ⊂ V (G01 ). That operation produces two partial graphs, G10 and G11 . The replicated complementary graphs G10 and G11 are created for VH and V (G01 )\VH sets respectively, in compliance with the rules described in the proof of Theorem 1. For G10 new agent is created while the agent that executes the SplitGraph(G01 ,VH ) operation, say A, continues maintaining of G11 graph. Remark. To split the initial graph G01 according to subsets {Vi }i=1..p (Vi ∩ Vj = ∅ for i = j and pi=1 = V (G01 )), agent A should execute iteratively SplitGraph(Gk1 , Vk+1 ) where for some Gkj , k denotes a number of an iteration (k = 0, 1, . . . p − 1) and j = 0, 1 indexes resultant partial graphs in a given iteration. It is easy to check that for a given set of RCGs, G, splitting any Gi ∈ G into some Gi , Gi preserves properties of G: Gi and Gi are complementary themselves and the SplitGraph procedure does not influence graphs other than Gi (only internal data inside replicated nodes, maintained for improving efficiency of agents communication, can be modified or enriched). JoinGraphs(DEL) operation broadcasts to all agents a request of sending copies of complementary graphs maintained by particular agents; next it assembles (in a transactional mode) gathered portions of data and finally recovers centralized graph G. If DEL = true then all agents that have provided their local graphs finish their work (i.e. are killed). Next three operations are associated with RCGs to enhance a replication of nodes and to enable an application of graph grammar transformation rules. Expand(v, i, CON D) - for each replica u of a given v ∈ Replicas(Gi ) all neighbors of u that fulfill a given condition CON D, are replicated and those new replicas are attached to Gi together with the edges connecting them with u. Condition CON D allows to limit a growth of the graph Gi by taking only selected nodes, according to their labeling, attributing or other properties provided in CON D. This is very useful in practical application but in this paper we will not consider it assuming for simplicity that CON D ≡ true. Incorporate(v, i) - for a given v ∈ Replicas(Gi ) this operation shifts boundaries between partial graphs in the following way (let us denote a replica of v as u): all neighbors of u’s (existing in another complementary graphs) are replicated and those new replicas are added to Gi together with the edges connecting them with u. Then u is removed from another partial graphs (finally v becomes a private node in Gi ). Sometimes the node incorporation may cause splitting another partial graph into few subgraphs; if any of those fragments consists of a single isolated node (being the replicated one) then we remove it, otherwise new agents are created to maintain new partial graphs. Note that removing such replicated isolated node doesn’t impact the consistency of RCGs because other replicas still exist in RCGs.
Graph Transformations Supported by Replicated Complementary Graphs
259
Fig. 2. Complementary form of G, {G1 , G2 , G3 } after Incorporate((−1, 2), 3)
Fig. 3. Complementary form of G, {G1 , G2 , G3 } after Expand((−1, 2), 3, true)
Neighborhood(v, G, k) - for any node v ∈ V (Gi ) it returns the graph B such that V (B) = {w : ∃p ∈ Path(G, v, w) and number of nodes in p is less or equal to k } Determining of B requires a recursive cooperation of agents maintaining partial graphs and its cost estimation will be discussed in the next section. Example. Let us consider complementary graphs represented in Fig. 1b. The effect of application of operation Incorporate((−1, 2), 3), is presented in Fig. 2 and the application of the operation Expand((−1, 2), 3), is presented in Fig. 3. It should be remarked that G3 shown in Figures 2 and 3 consists of the same nodes in both cases, but node labeled with D in the incorporation case is the private one. The significant difference appears in the structure of the graph G1 from which node (−1, 2) was incorporated. The lack of node labeled by E in graph G1 in Fig. 2 is the result of incorporation of (−1, 2) node. This incorporation split graph G1 into two subgraphs: the first which is represented in Fig. 2 and second one that consisted of the single node (labeled by E) only. Since this node was replicated we could remove it without loss of consistency (other replicas (−1, 1) are left in G2 and G3 ). Theorem 3. SplitGraph, JoinGraphs, Incorporate, Neighborhood, Expand procedures have a polynomial computational complexity. Proof. Two factors influence the computational complexity in this case: a complexity of algorithms and a communication overhead measured in a number of exchanged messages. Two-phase commit protocol assuring a transactional execution of these operations has a linear complexity in the domain of exchanged messages (i.e. O(p) where p is a number of agents). This fact in conjunction with Theorems 1 and 2 proves the complexity of SplitGraph and JoinGraphs operations.
260
L. Kotulski and A. S¸edziwy
The Expand(v, i, Cond) and Incorporate(v, i) require a message exchange only (in a transactional mode) between the agent Ai and agents maintaining replicas of the node v; we assume that a replicated node stores inside its structure a list of agents containing its replicas, so those agents know themselves. Thus a total complexity of those procedures is linear (i.e. O(p) where p). Determining the complexity of evaluating B = Neighborhood(v, G, k) is more complex. In [8] it was proved that this complexity is both O(pk ) and O(n · p), where n in number of nodes. We consider both of these limitations because in practice they help to reduce a number of sent messages. SplitGraph operation used for splitting a global graph G, produces a ”raw” decomposition of a centralized graph. To obtain an optimal partitioning one has to tune up this ”raw” set of RCGs after initial SplitGraph. In general, finding an optimal,according to the given criterion, partitioning of G is the NP-hard problem (one has to check all possible decompositions). The relaxation method, considering size of Gi as a criterion, proposed in [9] gives a partitioning in which 98% of partial graphs fulfill a size criterion.
4
Local Application of the Transformation Rules
While graphs represent a current state of system, transformation rules allow to formalize its dynamic behavior. We assume that we have given a set of transformation rules T R (referred to as productions of a graph grammar) defined for the centralized graph G. There are many ways one can define graph grammar productions for algebraic and algorithmic graph transformations, but the main idea of the proposed solution does not depend on the type of the graph transformation and differs only in technical details. Their common feature is the existence of the graph L being a left hand side of a production P ∈ T R. Application of P to the graph G is possible when the homomorphism m : L G is defined. An occurrence of m(L) is replaced by the graph R being the right hand side of production P (i.e. R is being embedded in G − m(L)). The difference is in a way of embedding R in G − m(L) and in a power of expression of a grammar. In this section we would like to investigate the possibility of application of T R to RCGs maintained by GRADIS multiagent framework. We will consider in detail double pushout approach [1]. For the clarity of presentation we assume that the morphism m is the identity mapping. Theorem 4. Let T R be the set of productions of the graph grammar and {Gi } be the finite set of replicated complementary graphs obtained form the graph G. Then we are able to apply T R ⊆ T R to {Gi } in a parallel way that will be equivalent to the sequential application of T R to graph G (i.e. in the centralized case). The complexity of preparation of an environment to a local application of P ∈ T R is polynomial. Proof. To apply the production P ∈ T R the agent A0 has to follow three steps: – determine B0 = V (L) ∩ V (G0 ), and (if needed) find an occurrence of V (L) − B0 in the distributed environment,
Graph Transformations Supported by Replicated Complementary Graphs
261
– incorporate V (L) − B0 , – apply locally the production P . First step is finding an occurrence of L in the partial graph managed by the agent A0 and/or, in the neighbor ones. A0 does that starting from some seed node v ∈ V (L) ∩ V (G0 ) and making a lookup in the neighborhood of v. It’s obvious that L ⊆ B = Neighborhood(v, G, k) set, where for a given L, k ≤ |V (L)| is a diameter of an underlying undirected graph L . As nodes of B set can be placed in other partial graphs, B is computed in cooperation with other agents: A0 sends a request to those agents (which are denoted as Aj , j = 1, 2, . . . n) and receives required data (Bj ) in a feedback. It has to be emphasized that this data is volatile, i.e. some nodes may be removed from a given partial graph after delivering requested data to A0 . B is recovered form all Bj (j = 0, . . . n): B = j=0...n Bj . When an occurrence of L in B is found then A0 may proceed to the second step1 . The part of B which is essential for the production P is B = L∩B. At the level of a particular partial graph it is Bi = L∩Bi . In the second step A0 has to ensure that L will be accessible for applying the production P . According to the two-phase commit protocol semantics, in the first phase the agent A0 sends the request to the other agents, Ai , to block all nodes belonging to corresponding Bi s. Ai blocks the nodes (agreement=yes) or rejects the request in two possible circumstances: 1. some nodes of Bi are already blocked (agreement=noaccess) or 2. some nodes of Bi don’t exist in a partial graph managed by Ai (agreement=nonexist), e.g. they were incorporated in the meantime by some other agents. Second phase depends on feedback data received by A0 . It can follow one of three possible scenarios: 1. If all responses are agreement=yes then A0 sends commit request to all Ai s which supply corresponding Bi to A0 in respond. 2. At least one response is agreement=nonexist then A0 sends abort request to all Ai s and tries to find B (B ) again. 3. At least one response is agreement=noaccess then A0 sends abort request to all Ai s and repeats first phase with random delay. In the third step, when all responses are agreement=yes, A0 shifts local graphs boundaries (by incorporating corresponding Bi subgraphs) to obtain an exclusive access to nodes required to apply locally the production P . Note that after incorporating Bi (i = 1, 2, . . . k) by A0 , all nodes belonging to those subgraphs get private nodes of G0 and all nodes being the neighbor ones for Bi s get or remain the border nodes for G0 . 1
If Bi ∩ L ⊂ Bj ∩ L for some i and j then only Bj is taken to the second step.
262
L. Kotulski and A. S¸edziwy
(a)
(b)
Fig. 4. (a) The sample of DPO production P = (L, K, R). (b) G3 ready to local P
production P (left) and G3 obtained in G3 −→ G3 . Table 1. The Bi and Bi sets i
Bi
Bi
i = 1 (request to A1 ) (−1, 1), (−1, 2), (1, 3) (−1, 1), (−1, 2) i = 2 (request to A2 ) (−1, 1), (−1, 3), (2, 1) (−1, 1), (−1, 3)
Example. Let us consider the DPO graph grammar production of the form P : L ← K → R (Fig.4a) defined for the centralized graph G shown in Fig.1a. We want to apply P in the distributed environment consisting of three complementary graphs as shown in Fig.1b. The agent maintaining complementary graph G3 , say A3 , is to apply P locally. The first step of A3 is assuring an exclusive access to nodes matching V (L), by incorporating them. The detailed description of it is presented below. A3 (managing the partial graph G3 ) has to discover and incorporate all the nodes matching L, namely (−1, 1), (−1, 2), (−1, 3). The border nodes (−1, 2), (−1, 3) are already present in G3 (they match the nodes of L indexed with 1 and 3 as shown in Fig.4a) so A3 tries to discover the lacking node matching vertex of L indexed with 2 and labeled with E. L diameter is 1 hence A3 sends the request to partial graphs containing replicas of u = (−1, 2) and w = (−1, 3) to supply Bi = Neighborhood(vb , G, 1), where vb = u, w respectively. Table 1 presents the set of nodes of the particular Bi and Bi , (i = 1, 2). Using Bi s instead of Bi s may improve effectiveness by reducing the number of the 3nodes to be incorporated. After identifying an occurrence of L in B = i=1 Bi , A3 incorporates all nodes matching the L, i.e. (−1, 1), (−1, 2), (−1, 3). We assume that no parallel action is made by A1 and A2 agents so A3 receives agreement=yes messages and sends the commit request to them. After subsequent incorporations of the nodes (−1, 2), (−1, 1), (−1, 3) (their new indexation is (3, 3), (3, 5), (3, 4) respectively) we obtain G3 . The left hand side graph L can be matched appropriately to the internal nodes of G3 labeled by D, E and F (shaded nodes in Fig.4b) and thus the production P can be applied locally. Resultant complementary graph G3 is
Graph Transformations Supported by Replicated Complementary Graphs
263
shown in Fig.4b (at right). Introducing the replicated form of complementary graphs optimizes cooperation of agents applying graph transformation rules. Nodes replication extends a scope of visibility of a graph from a single agent point of view. Let us note that in the case of replication presented in Fig. 3 agent maintaining G3 doesn’t need to recover a neighborhood environment (B) in order to find matching of left hand side of production, L, in G3 . Moreover, in the previous method some incorporations are not necessary for the production P , but we have to incorporate B . In case of DPO graph transformations, because of the dangling condition, only nodes being removed (i.e belonging to L − K) must be private ones, other vertices may be either replicated or private [7]. Thus to apply production P , like in the case of graph presented in Fig. 3, we have to perform only Incorporate((−1, 1), 3) operation. Without replication (Fig.1b), one of the procedures, Incorporate((−1, 3), 3) or Incorporate((−1, 2), 3) has to be performed to make the node (−1, 1) be visible inside G3 .
5
Conclusions
Graphs and their transformations accomplished as productions of graph grammars are the formalism suitable for describing and modeling various types of systems including evolving ones. Unfortunately, in most cases, its disadvantage is exponential computational complexity of algorithms (e.g. for parsing or membership problem). Even using grammars with polynomial parsing time (like ETPL(k) or edNLC) may be not sufficient to make an approach be applicable in a practical use. It seems that the proper solution of this problems is using graph representation enabling parallelization of computations and distributing those computations over a multiagent system. The proposed solution has to fulfill two additional conditions (i) new, parallel representation has to be equivalent to a centralized description during entire lifetime of a system and (ii) the complexity of migrations from centralized to distributed description and back again, has to be polynomial. The concept of replicated complementary graphs applied in GRADIS platform and presented in the paper satisfies those postulates. We proved that both the migration between centralized and distributed form of system description and all core procedures underlying RCGs environment have polynomial complexities. Those considerations include also the message exchange in the multiagent environment. Described model is being implemented on JADE platform [16] in compliance with FIPA specifications.
Acknowledgment The paper is supported by the Alvis project funded from 2009-2011 resources as a research project.
264
L. Kotulski and A. S¸edziwy
References 1. Corradini, A., Montanari, U., Rossi, F., Ehrig, H., Heckel, R., L¨ owe, M.: Algebraic approaches to graph transformation - part i: Basic concepts and double pushout approach. In: [12], pp. 163–246 (1997) 2. Ehrig, H., Engels, G., Kreowski, H.-J., Rozenberg, G.: Handbook of Graph Grammars and Computing By Graph Transformation: Volume II, Applications, Languages, and Tools. World Scientific Publishing Co., River Edge (1999) 3. Ehrig, H., Kreowski, H.-J., Montanari, U., Rozenberg, G.: Handbook of Graph Grammars and Computing By Graph Transformation: Volume III, Concurrency, Parallelism, and Distribution. World Scientific Publishing Co., River Edge (1999) 4. Ehrig, H., Ehrig, K., Prange, U., Taentzer, G.: Fundamentals of Algebraic Graph Transformation (Monographs in Theoretical Computer Science. An EATCS Series). Springer-Verlag New York, Inc., Secaucus (2006) 5. Ehrig, H., Ermel, C., Hermann, F.: On the relationship of model transformations based on triple and plain graph grammars. In: GRaMoT 2008: Proceedings of the Third International Workshop on Graph and Model Transformations, New York, NY, USA, pp. 9–16 (2008) 6. Kotulski, L.: GRADIS – Multiagent Environment Supporting Distributed Graph Transformations. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008, Part III. LNCS, vol. 5103, pp. 644–653. Springer, Heidelberg (2008) 7. Kotulski, L., Sedziwy, A.: Parallel Graph Transformations With Double Pushout Grammars. In: ICAISC 2010. LNCS (LNAI), vol. 6113, pp. 280–288 (2010) 8. Kotulski, L., S¸edziwy, A.: On the Complexity of Coordination of Parallel Graph Transformations. In: Fourth International Conference on Dependability of Computer Systems DepCoS - RELCOMEX 2009, pp. 279–289 (2009) 9. Kotulski, L., S¸edziwy, A.: On the effective distribution of knowledge represented by complementary graphs. In: Jedrzejowicz, P., Nguyen, N.T., Howlet, R.J., Jain, L.C. (eds.) KES-AMSTA 2010. LNCS (LNAI), vol. 6070, pp. 381–390. Springer, Heidelberg (2010) 10. Kreowski, H.J., Kluske, S.: Graph Multiset Transformation as a Framework for Massive Parallel Computation. In: Ehrig, H., Heckel, R., Rozenberg, G., Taentzer, G. (eds.) ICGT 2008. LNCS, vol. 5214, pp. 351–365. Springer, Heidelberg (2008) 11. Prunsinkiewicz, P., Lindenmayer, A.: The Algorithmic Beauty of Plants. Springer, New York (1990) 12. Rozenberg, G.: Handbook of Graph Grammars and Computing By Graph Transformation: Volume I, Foundations. World Scientific Publishing Co., River Edge (1997) 13. Sycara, K.P.: Multiagent system. AI Magazine, 79–92 (1998) 14. FIPA, http://www.fipa.org/ 15. FIPA ACL Specifications, http://www.fipa.org/repository/aclspecs.html 16. JADE, http://jade.tilab.com/ 17. Retsina, http://www.cs.cmu.edu/~softagents/retsina.htm
Diagnosis of Cardiac Arrhythmia Using Fuzzy Immune Approach Olgierd Unold Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland
[email protected] http://olgierd.unold.staff.iiar.pwr.wroc.pl/
Abstract. Classification is an important data mining task in biomedicine. For easy comprehensibility, rules are preferrable to another functions in the analysis of biomedical data. The aim of this work is to use a new fuzzy immune rule-based classification system for a medical diagnosis of a cardiovascular disease. In this study, fuzzy immune approach (FIA), which can be improved by ours, is a new method and firstly, it is applied to ECG dataset. The performance of the proposed approach, in terms of classification accuracy, ROC curves, and area under the ROC curve (AUC) was compared with traditional classifier schemes: C4.5, Na¨ıve Bayes, KStar, Meta END, and ANN. The classification accuracies and AUC statistics of FIA for the data sets used are the highest among the classifiers reported on the UCI website and other classifiers used for related problems and tested by cross validation. Keywords: Machine learning, Fuzzy logic, Artificial immune system, Data mining.
1
Introduction
The electrocardiogram (ECG) is a measure of the electrical activity of the heart. Since its introduction in 1893, ECG has been used as a clinical tool for evaluating heart function. A number of cardiovascular diseases like an arrhythmia and coronary arterial disease can be detected non-invasively using ECG monitoring devices. The expanded set of features are typically extracted from the ECG time series. Having so many factors to analyze, a variety of machine learning techniques can be employed to assist in cardiovascular disease classification. Fuzzy logic has been applied to design classification systems due to its powerful capabilities of handling uncertainty and vagueness. A fuzzy rule-based classification system (FRBCS) is a special case of fuzzy modeling where the output of the system is crisp and discrete. Basically, the design of an FRBCS consists of finding a compact set of fuzzy IF-THEN classification rules to be able to predict a class for unknown example. The most challenging problem in the design of FRBCSs is the construction of rule-base for a task to be solved. Many approaches have been proposed to construct the rule-base from data (see [26]). ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 265–274, 2011. c Springer-Verlag Berlin Heidelberg 2011
266
O. Unold
A quite novel approaches, among others, integrate Artificial Immune Systems (AISs) [6] and Fuzzy Systems to find not only accurate, but also linguistic interpretable fuzzy rules that predict the class of an example. The first AIS-based method for fuzzy rules mining was proposed in [2]. This approach, called IFRAIS (Induction of Fuzzy Rules with an Artificial Immune System), uses sequential covering and clonal selection to learn IF-THEN fuzzy rules. One of the AISbased algorithms for mining IF-THEN rules is based on extending the negative selection algorithm with a genetic algorithm [8]. Another one is mainly focused on the clonal selection and so-called a boosting mechanism to adapt the distribution of training instances in iterations [1]. A fuzzy AIS was proposed also in [17], however that work addresses not the task of classification, but the task of clustering. Efficient and effective classification is a core problem in biomedical data mining. The aim of this work is to use a new fuzzy immune approach derivated from IFRAIS concept for cardiac arrhythmia classification. We have applied the fuzzy immune approach (FIA) to distinguish between healthy and diseased persons. A quite a lot of studies reported have been focused on the diagnosis of ECG arrhythmia and achieved high classification accuracies over the dataset taken from UCI machine learning repository1. G¨ uvenir et al. obtained 56.29% accuracy on the diagnosis of arrhythmia disease by CFI classification algorithm and 5-fold cross validation [11]. Soman and Bobbie obtained 59.47% classification accuracies using OneR algorithm over the 50–50% training-test dataset [24]. PiatetskyShapiro and Frawley gained 74.73% over the 80-20% training-test dataset using Na¨ıve Bayes approach [19]. Polat et al. reached 76.2% accuracy using fuzzy weighted pre-processing and artificial immune recognition system and 10-fold cross validation, and 80.71% accuracies over the 80-20% training-test dataset [20]. In [21] Polat reported 100% accuracy with support vector machine approach for all the training-to-test splits. The performance of the system was analyzed with regard to the classification accuracy, and the area under the Receiver Operating Characteristic (ROC) curve. ROC curves were generated to present obtained results. The performance of the presented method exceeds that of other studies applied to the ECG dataset classification problem so far. This paper is organized as follows. Section 2 describes methods used in the research, and next gives details of the proposed approach. Section 3 discusses the experimental results. Finally Section 4 concludes the paper with future works.
2
Fuzzy Immune Approach
A fuzzy rules can be expressed as a conditional statement in the form: Rule Rj : IF x1 is Aj1 and ...and xn is Ajn THEN Class Cj , where Rj is the label of the j-th rule, x = (x1 , ..., xn ) is an n-dimensional pattern vector, Ajn is an antecedent fuzzy set, i.e. lingustic variable. A linguistic variable is a fuzzy 1
http://archive.ics.uci.edu/ml/ (last accessed January 2011).
Diagnosis of Cardiac Arrhythmia Using Fuzzy Immune Approach
267
variable. The set of fuzzy subsets of the linguistic variable is called a fuzzy partion. Fuzzy reasoning includes two distinct parts: evaluating the rule antecedent (i.e. IF part of the rule) and applying the result to the consequent (i.e. THEN part of the rule). A truth membership grade of the rule conseqent can be estimated directly form a corresponding truth membership grade in the antecedent [5]. A fuzzy rules can have multiple antecedents. All parts of the antecedents ale calculated simultaneously and resolved in a single number, using fuzzy set operations (like intersection min). Generation of fuzzy IF-THEN rules from numerical data consists of two phases: the first, fuzzy partition of attribute to number of linguistic variables and determine the membership function for each linguistic variables. The second, determination of fuzzy IF-THEN rules. The use of Artificial Immune Paradigm for discovering comprehensible IFTHEN classification rules is much less explored in the literature when compared with other traditional rule induction in FRBSs. The presented fuzzy immune approach inherits his architecture from IFRAIS [2], but differs from it in the following way: – using uniform population [16] instead of randomly created initial population in the standard IFRAIS, – rules buffering in clonal selection algorithm [15], – fuzzy partition learning [12] in a place of three and only three linguistic terms associated with each continuous attribute in the standard IFRAIS. FIA, as an Artificial Immune System evolves a population of antibodies representing the IF part of a fuzzy rule, whereas each antigen represents an learning example. Each rule antecedent consists of a conjunction of rule condition. FIA, like IFRAIS uses a sequential covering as a main learning algorithm. In the first step a set of rules is initialized as an empty set. Next, for each class to be predicted the algorithm initializes the training set with all training examples and iteratively calls clonal selection procedure with the parameters: the current training set and the class to be predicted. The clonal selection procedure returns a discovered rule and next the learning algorithm adds the rule to the rule set and removes from the current training set the examples that have been correctly covered by the evolved rule. Clonal selection algorithm is used to induct rule with the best fitness from training set (see Algorithm 1). Basic elements of this method are antigens and antibodies which refers directly to biological immune systems. Antigen is an example from data set and antibody is a fuzzy rule. Similarly to fuzzy rule structure, which consists of fuzzy conditions and class value, antibody comprises genes and informational gene. Number of genes in antibody is equal to number of attributes in data set. Each gene consists of a fuzzy rule and an activation flag that indicates whether fuzzy condition is active or inactive. In the first step the algorithm generates randomly antibodies population with informational gene equals to class value c passed in algorithm parameter. Next each antibody from generated population is pruned. Rule pruning has a twofold
268
O. Unold
Algorithm 1. Clonal selection algorithm Input: training set, class value (c) Output: fuzzy rule CREATE uniformly antibodies population with size s and class value c for all antibody A in antibodies population do PRUNE(A) COMPUTE FITNESS(A, training set) end for for i = 1 TO number of generations n DO do while clones population size < s − 1 do antibody to clone = TOURNAMENT SELECTION(antibodies population) clones = CREATE × CLONES(antibody to clone) clones population = clones population + clones end while for all clone K in clones population do muteRatio = MUTATION PROBABILITY(K) MUTATE(K, muteRatio) PRUNE(K) COMPUTE FITNESS(K, training set) end for antibodies population = SUCCESSION(antibodies population, clones population) end for result = BEST ANTIBODY(antibodies population) return result
motivation: reducing the overfitting of the rules to the data and improving the simplicity (comprehensibility) of the rules [25]. Fitness of the rule is computed according to the formula f itness(rule) =
tp tn · tp + f n tn + f p
(1)
where: – tp is the number of examples satisfying the rule and having the same as predicted by the rule (true positives); – f n is the number of examples that do not satisfy the rule but have the predicted by the rule (false negatives); – tn is the number of examples that do not satisfy the rule and do not the class predicted by the rule (true negatives); – f p is the number of examples that satisfy the rule but do not have the predicted by the rule (false positives).
class class have class
Since the rules are fuzzy, the computation of the tp, f n, tn and f p involves measuring the degree of affinity between the example and the rule. This is computed by applying the standard aggregation fuzzy operator min af f inity(rule, example) = mincondCount (µi (atti )) i=1
(2)
Diagnosis of Cardiac Arrhythmia Using Fuzzy Immune Approach
269
where µi (atti ) denotes the degree to which the corresponding attribute value atti of the example belongs to the fuzzy set accociated with the ith rule condition, and condCount is the number of the rule antecedent conditions. The degree of membership is not calculated for an inactive rule condition, and if the ith condition contains a negation operator, the membership function equals to (1 − µi (atti )) (complement). An example satisfies a rule if af f inity(rule, example) > L, where L is an activation threshold. Next, antibody to be cloned is selected by tournament selection from the antibodies population. For each antibody to be cloned the algorithm produces x clones. The value of x is proportional to the fitness of the antibody. Next, each of the clones undergoes a process of hypermutation, where the mutation rate is inversely proportional to the clone’s fitness. Once a clone has undergone hypermutation, its corresponding rule antecedent is pruned by using the previously explained rule pruning procedure. Finally, the fitness of the clone is recomputed, using the current training set. In the last step the T -worst fitness antibodies in the current population are replaced by the T best-fitness clones out of all clones produced by the clonal selection procedure. Finally, the clonal selection procedure returns the best evolved rule, which will then be added to the set of discovered rules by the sequential covering.
3
Experimental Results
3.1
Datasets
In this study we used the Echocardiogram dataset obtained from UCI machine learning repository2. The ECG dataset is grouped into two broad classes to facilitate their use in experimentally determining the presence or absence of arrhythmia, and for identifying the type of arrhythmia. In the set, class 0 refers to dead ECG, and class 1 refers a live ECG. The arrhythmia dataset has 13 attributes. This dataset contains 88 dead people and 44 a live people belong to ECG dataset. We briefly describe the biological motivation for the data set. Attribute information: – survival – the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above; to be ignored, – still-alive – a binary variable. 0=dead at end of survival period, 1 means still alive; to be ignored, – age-at-heart-attack (Age) – age in years when heart attack occurred, – pericardial-effusion (Pericardial) – binary. Pericardial effusion is fluid around the heart. 0 = no fluid, 1 = fluid, 2
http://archive.ics.uci.edu/ml/datasets/Echocardiogram (last accessed January 2011).
270
O. Unold
– fractional-shortening (Fractional)– a measure of contracility around the heart lower numbers are increasingly abnormal, – epss (Epss) – E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal, – lvdd (Lvdd) – left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts, – wall-motion-score (WallScore) – a measure of how the segments of the left ventricle are moving, – wall-motion-index (WallIndex) – equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable instead of the wall motion score, – mult (MULT) – a derivate variable, – name – the name of the patient; to be ignored, – group – meaningless; to be ignored, – alive-at-1 (Alive)– Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year. Note that only continuous attributes of ECG dataset are fuzzified. There are missing values in the analysed dataset. In such a case, we used the simplest approach to handle them - just ignoring the examples with unknown attribute values [10]. 3.2
Comparision of Different Methods
FIA was tested among several different machine learning methods. Here we briefly list the methods that we used to compare with our approach. – C4.5 Algorithm. C4.5 builds decision trees from a set of training data using the concept of information entropy. It is based on ID3 (Iterative Dichotomiser 3). Both algorithms were proposed by Ross Quinlan [22]. – Meta END. The main idea of meta-classification is to represent the judgment of each classifier (SVM based) for each class as a feature vector, and then to re-classify again in the new feature space. The final decision is made by the meta-classifiers instead of just linearly combining each classifiers judgment [13]. – Na¨ıve Bayes. The Na¨ıve Bayes algorithm is based on conditional probabilities. It uses Bayes’ Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data [23]. It is naive beacause it assumes attribute independence. – KStar. K ∗ is an instance-based classifier using an Entropic Distance Measure. It provides a consistent approach to handling of symbolic attributes, real valued attributes and missing values [4]. – ANN. A classifier that uses backpropagation to classify instances.
Diagnosis of Cardiac Arrhythmia Using Fuzzy Immune Approach
3.3
271
Criteria and Results of Comparison
In our experiments, we randomly split the UCI cardiac arrhythmia database into 10 folds, and use a 10-fold cross validation method to determine the classification accuracy. To reduce bias in evaluating the performance, we calculate the average of the classification accuracy of the 10 runs of 10-fold cross validation. The statistical analysis was based on the area under the Receiver Operating Characteristics (ROC) curve [9]. Roughly speaking, a ROC graph is a plot of the fraction of positive examples misclassified – false positive rate (f pr) – on the x axis against the fraction of positive examples correctly classified true positive rate (tpr) – on the y axis. fpr ≈
negatives incorrectly classified total negatives
tpr ≈
positives correctly classified total positives
Expressing f pr and tpr in terms of true/false positives/negatives we obtain fp f p + tn tp tpr = tp + f n
f pr =
(3) (4)
The classification accuracies for the datasets are measured using the equation Acc =
tp + tn tp + f n + f p + tn
(5)
The area under the ROC curve is equivalent of the Mann-Whitney U statistic [3] normalized by the number of possible pairings of positive and negative values. The area under the ROC curve (AUC) actually represents the probability that a randomly chosen positive example is correctly rated (ranked) with greater suspicion than a randomly chosen negative example. For calculating the Area Under the Curve for FIA and IFRAIS AUCCalculator was used [7]. Table 1. Accuracy rate (Acc) and the area under the ROC curve (AUC) over the different classifiers Classifier
Acc
AUC
FIA IFRAIS C 4.5 Meta Na¨ıve K∗ ANN
91.94 80.65 71.62 75.68 75.67 63.51 72.87
0.95 0.75 0.64 0.63 0.68 0.55 0.59
272
O. Unold
Table 1 shows for each classifier (algorithm) the average accuracy rate Acc and the area under the ROC curve AUC. As shown in Table 1 the FIA obtained better results than any other algorithm presented in the Table. In addition, ROC curves for patient dead after 1 year (the attribute Alive equals 0) are shown in Fig.1. Table 2 contains an examplary set of the inferred rules by FIA for ECG dataset. 1
0.8 FIA IFRAIS C4.5 Meta Naive KStar ANN
tpr
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
fpr
Fig. 1. ROC curves for compared classifiers Table 2. The inferred rule set for ECG No. Rule 1. IF WallIndex=0 and Mult!=1 THEN class = 0 2. IF Age=5 and WallScore!=3 and WallIndex!=0 THEN class = 0 3. IF WallScore!=4 and WallIndex!=0 and Mult!=3 THEN class = 1
4
Conclusion
The new fuzzy immune rule-based classification system for a medical diagnosis of a cardiovascular disease was applied. The performance of the system under study in terms of classification accuracy, ROC curves, and area under the ROC curve was compared with traditional classifier schemes. The FIA algorithm outperforms compared approaches tested by 10-fold cross validation, not only in terms of the accuracy, but in terms of AUC statistic too. Nevertheless, there are several possible directions for future work. A drawback of FIA is related to its computational complexity, which is higher than compared approaches, mainly due to time-consuming immune-based partition learning. At
Diagnosis of Cardiac Arrhythmia Using Fuzzy Immune Approach
273
present, we are working on way to make it more efficient by replacing the immune approach by other heuristic method, like granular computing [18]. A use of more sophisticated membership functions would yield to further improvement.
References 1. Alatas, B., Akin, E.: Mining Fuzzy Classification Rules Using an Artificial Immune System with Boosting. In: Eder, J., et al. (eds.) ADBIS 2005. LNCS, vol. 3631, pp. 283–293. Springer, Heidelberg (2005) 2. Alves, R.T., et al.: An artificial immune system for fuzzy-rule induction in data mining. In: Yao, X., et al. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 1011–1020. Springer, Heidelberg (2004) 3. Bamber, D.: The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematics and Psychology 12, 387–415 (1975) 4. Cleary, J.G., Trigg, L.E.: K: An Instance- based Learner Using an Entropic Distance Measure. In: Proceedings of the 12th International Conference on Machine Learning, pp. 108–114 (1995) 5. Cox, E.: The Fuzzy Systems Handbook: A Practitioner’s Guide to Building, using, and Maintaining Fuzzy Systems. Academic Press, Cambridge (1994) 6. Dasgupta, D. (ed.): Artificial Immune Systems and Their Applications. Springer, Heidelberg (1999) 7. Davis, J., Goadrich, M.: The Relationship Between Precision-Recall and ROC Curves. In: 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, June 26-28 (2006) 8. Gonzales, F.A., Dasgupta, D.: An Immunogenetic Technique to Detect Anomalies in Network Traffic. In: Proceedings of Genetic and Evolutionary Computation, pp. 1081–1088. Morgan Kaufmann, San Francisco (2002) 9. Green, D.M., Swets, J.M.: Signal detection theory and psychophysics. John Wiley & Sons Inc., New York (1966) 10. Grzymala-Busse, J.W., Hu, M.: A Comparison of Several Approaches to Missing Attribute Values in Data Mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001) 11. G¨ uvenir, H.A., Acar, B.: Feature selection using a genetic algorithm for the detection of abnormal ECG recordings. In: Proceedings of the World Conference on Systemics, Cybernetics and Informatics (ISAS/SCI 2001), Orlando, FL, pp. 437– 442 (2001) 12. Kalina, A., M¸ez˙ yk, E., Unold, O.: Accuracy Boosting Induction of Fuzzy Rules with Artificial Immune Systems. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisla, Poland, pp. 155–159 (2008) 13. Lin, W., Jin, R., Hauptmann, A.: Meta-classification of Multimedia Classifiers. In: International Workshop on Knowledge Discove. In: Multimedia and Complex Data, Taipei, Taiwan (2002) 14. Marsala, C.: Fuzzy Partitioning Methods, Granular Computing: An Emerging Paradigm, pp. 163–186. Physica-Verlag GmbH, Heidelberg (2001) 15. M¸ez˙ yk, E., Unold, O.: Speed Boosting Induction of Fuzzy Rules with Artificial Immune Systems. In: Mastorakis, E.M., et al. (eds.) Proc. of the 12th WSEAS International Conference on Systems, Heraklion, Greece, July 22-24, pp. 704–706 (2008)
274
O. Unold
16. M¸ez˙ yk, E., Unold, O.: Improving Mining Fuzzy Rules with Artificial Immune Systems by Uniform Population. In: Mehnen, J., et al. (eds.) Applications of Soft Computing. AISC, vol. 58, pp. 295–303. Springer, Heidelberg (2009) 17. Nasaroui, O., Gonzales, F., Dasgupta, D.: The Fuzzy Artificial Immune System: Motivations, Basic Concepts, and Application to Clustering and Web Profiling. In: Proceedings of IEEE International Conference on Fuzzy Systems, pp. 711–716 (2002) 18. Pedrycz, W.: Granular Computing. Studies in Fuzziness and Soft Computing. Physica-Verlag, Heidelberg (2001) 19. Piatetsky-Shapiro, G., Frawley, W.J.: Knowledge discovery in databases. AAAI Press, Menlo Park (1991) 20. Polat, K., Sahan, S., G¨ nes, S.: A new method to medical diagnosis: Artificial immune recognition system (AIRS) with fuzzy weighted preprocessing and application to ECG arrhythmia. Expert Syst. Appl. 31(2), 264–269 (2006) 21. Polat, K., Akdemir, B., Gune, S.: Computer aided diagnosis of ECG data on the least square support vector machine. Digital Signal Processing 18, 25–32 (2008) 22. Quinlan, J.R.: C4.5: Programs For Machine Learning. Morgan Kaufmann, San Mateo (1993) 23. Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence (2001) 24. Soman, T., Bobbie, P.O.: Classification of arrhythmia using machine learning techniques. In: Proc. of 4th International Conference on System Science and Engineering (ICOSSE), Copacabana, Rio de Janeiro, Brazil (2005) 25. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Mateo (2005) 26. Zolghadri Jahromi, M., Taheri, M.: A proposed method for learning rule weights in fuzzy rule-based classification systems. Fuzzy Sets and Systems 159, 449–459 (2008)
Adaptive Finite Automaton: A New Algebraic Approach Reginaldo Inojosa Silva Filho and Ricardo Luis de Azevedo da Rocha Computing Engineering Department Engineering School of the University of S˜ ao Paulo, Av. Luciano Gualberto s/n Trav 3, n.158 S˜ ao Paulo - SP, Brazil
[email protected],
[email protected]
Abstract. The purpose is to present a new and better representation for the adaptive finite automaton and to also show that both formulations – the original and the newly created – have the same computational power. Adaptive finite automaton original formulation was explored and a way to overcome some difficulties found by [7] in its representation and proofs about its computational power were sought. Afterwards both formulations show to be equivalent in representation and in computational power, but the new one has a highly simplified algebraic notation. The use of the new formulation actually allows simpler theorem proofs and generalizations, as can be verified in the last section of the paper. Keywords: Adaptivity, Automata Theory, Algebraic Formulation.
1
Introduction
The principal idea of automata theory led to the creation of more complex types of automata, with more general automaton behaviors [13]. In the classical formal languages theory, only automata models with invariable internal structure were considered. In the adaptive automata model, the highest computational power is achieved when this assumption is relaxed. The term adaptive has a well-defined sense within the self-modifying computational models. In this field, the adaptive formalisms may be divided into two main categories: adaptive grammars [3],[1] and adaptive machines [11]. Adaptive automata belong to the second category and their major characteristic is ability, without the interference of any external agent, to decide to modify its own structure in response to some external input. There are many applications for an adaptive automata model [2],[5] and its computational power is Turing Machine equivalent [9]. Despite these facts, the adaptive automata model is not fully formalized and the purpose here is to present a new and better formalization for this machine model. For this, the paper is structured as follows. The second section presents the notations and theoretical preliminaries as well as a brief description of the classical (original) adaptive finite automaton model. The third section describes the new formulation and the principal equivalence results in relation to the classical formulation. The last section presents the conclusion and further work to be developed. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 275–284, 2011. c Springer-Verlag Berlin Heidelberg 2011
276
2
R.I. Silva Filho and R.L. de Azevedo da Rocha
Notations and Technical Preliminaries
Several notations are required for the adequate discussion of the model presented. Let N = {0, 1, 2, . . . } be the set of natural numbers. For a finite arbitrary indexed set I = {i0 , i1 , . . . , im } the two functions, remove function and the expand function are defined as rem(I, x) = {I − {x} : x ∈ I}
(1)
(2) insert(I, x) = {i0 , i1 , . . . , im+1 } with (x ∈ / I) and im+1 = x If I is a typed set, the element inserted is of the same type as I. The Axiom of Choice [4] is assumed to hold. Thus, there is a choice function for every arbitrary st U, in which a choice function is defined as c(U ) ∈ U for U ∈ U. A non-empty sequence ϕ is an ordered list (ak , . . . , an−1 , an ) of abstract objects with [k, n] ⊆ N. The sequence is always nominated by the latest letters of the Greek alphabet. Let Σ be a non-empty fixed, but arbitrary, finite set of atomic symbols called alphabet. Any generic symbol of Σ is represented by the first letters of the Greek alphabet, for example, α ∈ Σ. A string t over Σ is a finite concatenation of alphabet symbols in which the length of t is denoted by |t| ∈ N. The empty word, denoted by ε, has its length equal to zero. The set of all possible strings over Σ is denoted Σ ∗ , while the set of all nonempty strings over Σ is denoted Σ + . Any subset L ⊆ Σ ∗ is called a language over Σ. A nondeterministic finite state automaton without outputs over Σ is defined by the quintuple M = (Q, q0 , E, Σ, ∂), in which Q is a finite, non-empty set of states, ∂ ⊆ Q × {Σ ∪ {ε}} × Q is the automaton state-transition partial function, which can be expressed by the set ∂ = {δ1 , . . . , δn }, in which δi = (q , α, q ) for 1 ≤ i ≤ n, {q , q } ⊆ Q and α ∈ Σ ∪ {}. The set E ⊆ Q is the accepting states set, while q0 is the initial state. A scalar hierarchical structure[12] is indicated by the notation an an−1 . . . a1 a0 , in which ai is part of ai+1 for 0 ≤ i ≤ (n − 1). 2.1
Adaptive Automata
To initiate the discussion about the structural redefinition of the adaptive automata model, it is first necessary to summarize its classical formulation. In the Adaptive Automaton [7], there is a set of (optional) adaptive actions which change the behavior of a non-deterministic finite state automaton by modifying the set of rules defining it. This modification is performed by an adaptive function, which is executed by removing or inserting new elements to the automaton transition function. The adaptive mechanism of the adaptive automaton is defined by attaching a pair of adaptive functions, B (before) and A (after). This pair of adaptive functions is attached to the subjacent non-adaptive automaton transitions, one to be performed before the transition takes place, and another to be performed after executing the transition. This altered automaton transition is called adaptive transition and its notation is summarized below: (q, α) : B → q : A in which q ∈ Q; α ∈ Σ ∪ {}; B and A are the adaptive functions.
(3)
Adaptive Finite Automaton: A New Algebraic Approach
277
In the general case, adaptive functions A and B are symbolically declared apart from the finite state automaton, and they comprehend a header (the name and the formal parameters of the adaptive function) and a body (declaration of names and elementary adaptive actions), which have the general form: η(Ω) { declaration of names (optional) declaration of elementary adaptive actions (optional) } in which : Ω = {ϕ1 , . . . , ϕn } is the n-parameters set of arguments ϕi (for 1 ≤ i ≤ n) passed by a call-by-value strategy to the adaptive function named η. The arguments may assumed to be any coherent value within the function body. The body part is formed by the declaration of names and a set of elementary adaptive actions, responsible for the modifications to be performed. Declaration of names is a list of elements chosen to represent objects in the scope of the function body. Each declaration of names assumes the following form: g1∗ , g2∗ , . . . , gi∗ ; v1 , v2 , . . . , vj in which the names followed by an asterisk denote generators (new names to objects that do not belong to the underlying finite automaton), and the remaining names denote variables. Elementary adaptive actions specify the actual modifications to be imposed on the automaton. Any elementary adaptive action potentially has local variables that are filled once by elementary inspection actions (defined below) and generators that are filled once with new values (different from all values or variables defined in the model) when the adaptive action starts. Values are assigned only once to variables by elementary adaptive inspection and elimination-type actions (explained below), then those objects become read-only. Generators receive unique values also once at the start of function execution and remain read-only. Elementary adaptive actions can be of three types: 1. ?[(q, α) : B → q : A]: Inspection-type actions (introduced by a question mark in usual notation) search the current set of transitions for those the shapes of which match the given pattern. The elements to be inspected are replaced by variable names. These variables must be unique, being filled by the inspection mechanism with the current value of the corresponding inspected unknown elements, and they become read-only thereafter. 2. −[(q, α) : B → q : A]: Elimination-type adaptive actions (introduced by a minus sign in usual notation), which eliminate all transitions matching the given shape from the current set of transitions in the automaton. 3. +[(q, α) : B → q : A]: Insertion-type adaptive actions (introduced by a plus sign in usual notation), which add a new transition to the set of current transitions, according to the specified shape. The adaptive mechanism turns a finite automaton into an adaptive one by allowing its set of rules to change dynamically. The adaptive function is performed following the sequence below [7]: a) the only values available for variables are those that came from parameters in function
278
R.I. Silva Filho and R.L. de Azevedo da Rocha
header, b) the generators are filled, c) the set of elementary adaptive actions of inspection is performed, d) the set of elementary adaptive actions of deletion is performed, e) the set of elementary adaptive actions of insertion is performed. In the classical formulation, there is no particular restriction to the use of multiple variables in inspecting and eliminating elementary adaptive actions [7]. Additionally, there is no restriction to the types that variables can take. Variables can assume the role of states or symbols in the transition pattern of an elementary adaptive action. Thus, the use of the variable concept is not clear. Adaptive functions were defined informally, using a pseudocode notation to describe the adaptive functions behavior. This fact implies that the adaptive automata model, despite its computational power and innovative paradigm, has ambiguous concepts, such as the nature and number of parameters allowed for adaptive functions and the use of variables [8]. As can be seen in the next section, in order to deal with these problems, it is necessary to formalize those definitions.
3
The Adaptive Automaton New Formulation
The task of rewriting the model of the adaptive automaton goes beyond finding the solutions for the problems mentioned above. First, as the name suggests, the use of the algebraic theoretical approach can be viewed as an attempt to understand certain properties (such as types of possible modifications, function composition, etc.) of the automaton transformations [6]. In this context, the concept of adaptive automata was inserted in a more general concept of adaptive algebra. Second, in turn, an algebraic extension of the automaton concept implies a new expression for it. Hence, the traditional elements of the automata theory (automata configuration, step function, etc.) were brought into this reformulation. The following concepts start this idea and provide the basis for the embedding of the adaptive automata. Definition 1 (M 0 - algebra). Given the set M 0 of all non-deterministic finite state automaton under an alphabet Σ, a M 0 - algebra ´ıs defined as M = (M 0 , F ) in which F = (f1 , f2 , ...fn ) for fi : M 0 −→ M 0 . Given a non-deterministic finite state automata M 0 ∈ M 0 and its state-transition function ∂, any transition δ = (q , α, q ) in which δ ∈ ∂ is called a proper transition. When δ ∈ / ∂, it is called a foreign transition. For any element M 0 ∈ 0 M , the functions defined below: S(qi , a, qj , M 0 ) = {(qi , a, qj ) : (qi , a, qj ) ∈ ∂}
(4)
S(qi , qj , M ) = {(qi , α, qj ) : (qi , α, qj ) ∈ ∂}
(5)
0
S(qi , M ) = {(qi , α, q ) : (qi , α, q ) ∈ ∂} 0
S(qj , M ) = {(q , α, qj ) : (q , α, qj ) ∈ ∂} 0
S(qi , a, M ) = {(qi , a, q ) : qi , a, q ) ∈ ∂} 0
(6) (7) (8)
Adaptive Finite Automaton: A New Algebraic Approach
279
S(qj , a, M 0 ) = {(q , a, qj ) : (q , a, qj ) ∈ ∂}
(9)
S(a, M ) = {(q , a, q ) : (q , a, q ) ∈ ∂} 0
(10)
are called proper-search operations. For any element M 0 ∈ M 0 , the function defined as: δ ⇔ δ∈ /∂ (11) N (δ, M 0 ) = ∅ ⇔ δ∈∂ is the non-pertinence identification operation. A sequence δpro = (δpro1 , . . . , δprom ) of proper transitions for an automaton M 0 is called positive sequence. Any positive sequence of M 0 in which all elements are any of proper-search operations is called a positive pattern sequence and is designated by δˆpro . On the other hand, a transitions sequence δf or = (δf or1 , . . . , δf orn ), of foreign transition for M 0 is called a negative sequence. Analogously to positive pattern sequence, the negative sequence which has all the elements as non-pertinence identification operations N (δ, M 0 ) is called a negative pattern sequence and is designated by δˆf or . Given a negative and a positive pattern sequences for an element M 0 ∈ M 0 , the sequence φ = (δˆf or , δˆpro ) is called a transformation pair. 3.1
Adaptive Algebra
Utilizing the proper transition and foreign transition concept, as well the pattern sequences and the definition of (1) and (2), for any element M 0 ∈ M 0 , the δ-remove operation and δ-insertion operation are defined, respectively, by the operators: fδ−pro M 0 = f − (δprok , M 0 ) = (Q, q0 , E, Σ, rem(∂, δprok )) with δprok ∈ δˆpro k (12) fδ+f or M 0 =f + (δf ork , M 0 ) = (insert(insert(Q, q ), q ), q0 , E, Σ, insert(∂, δf ork )) k (13) with δf ork ∈ δˆf or Now, it is possible to introduce the concept of adaptive algebra. Definition 2 (Adaptive Algebra). Adaptive Algebra is the M 0 - algebra in which F = (f − , f + ). 3.2
Adaptive Transformations
Let an element M 0 ∈ M 0 and its transformation pair φ = (δˆf or , δˆpro ) formed by positive pattern sequence δˆpro and a negative pattern sequence δˆf or , in which δˆpro = (δpro1 , . . . , δprom ) and δˆf or = (δf or1 , . . . , δf orn ). The Adaptive Function is defined as: Fφ M 0 F(φ, M 0 ) = Fδˆ− Fδˆ+ M 0 (14) pro
f or
280
R.I. Silva Filho and R.L. de Azevedo da Rocha
in which Fδˆ− M 0 F − (δˆpro , M 0 ) = (fδ−prom ◦ fδ−pro
m−1
pro
Fδˆ+ M 0 F + (δˆf or , M 0 ) = (fδ+f orn ◦ fδ+f or f or
n−1
◦ . . . ◦ fδ−pro ◦ fδ−pro )M 0
◦ . . . ◦ fδ+f or
2
(2)
1
◦ fδ+f or )M 0 1
(15) (16)
are the remove transformation and insertion transformation, respectively. The adaptive algebra operations provide the basis for building more complex operators. These operators will be used in the redefinition of adaptive automata. There are some assumptions to be considered: – For a transition δ = (q , α, q ), it is possible to have q ∈ Q and/or q ∈ Q for Q ∈ M 0 and still keeping the condition δ = (q , α, q ) ∈ / ∂ for ∂ ∈ M 0 . + – Fφ M 0 = Fδˆ M 0 when δˆpro ∈ φ is an empty sequence. f or – Fφ M 0 = F − M 0 when δˆf or ∈ φ is an empty sequence. δˆpro
– In the particular situation in which both sequences, δˆpro and δˆf or are empty, the first-order transformation pair is called void and is represented by φ∅ . In this case, Fφ∅ M 0 = M 0 . 3.3
Adaptive Automata
The new formulation for the adaptive automata model is more than a simple algebraic reinterpretation of the old definition. The purpose of this section is to show the most important elements of the automata theory within an adaptive model. Definition 3 (Adaptive Automata). A first-order adaptive automata is the quadruple M 1 = (M 0 , Φ, φ∅ , ∂ 1 ), in which M 0 ∈ M 0 is called subjacent device. Set Φ of the transformation pairs is called adaptive behavior set. The element φ∅ ∈ Φ is a void transformation pair called null behavior. Set ∂ 1 is 1 the adaptive transition function. Each element of ∂ 1 takes the form δi,k = 1 0 (δi , M Fφk (M )), for φk ∈ Φ and δi ∈ ∂ in which ∂ is the subjacent device state-transition function. The configuration of a first-order adaptive automaton M 1 is the duple (q , t) ∈ Q×Σ ∗ in which Q belongs to a subjacent device of M 1 . The initial configuration is represented by (q0 , t) in which q0 is the initial state of the subjacent device before the execution of any adaptive transition of M 1 . For the configuration (q , t) ∈ Q × Σ ∗ , the one step function shows how the adaptive automata changes from one configuration to another: (q , t) [Fφk M 0 ] (q , w) ⇔ ∃α ∈ Σ : αw = t
(17)
in which q is a state of Fφk M 0 and ((q , α, q ), M 1 Fφk (M 0 )) ∈ ∂ 1 for φk ∈ Φ. For a j ∈ N, the closure of the one move function for an adaptive automaton is defined as: (q , t) ∗[Fφ ...Fφ Fφ M 0 ] (q , w) (18) kj
iff (q = q ) and (w = t) or
k2
k1
Adaptive Finite Automaton: A New Algebraic Approach
281
1. t = a0 a1 . . . aj w with ai ∈ Σ for 0 ≤ i ≤ j 2. ∃(φk1 , φk2 , . . . φkj+1 ) with φki ∈ Φ for 0 ≤ i ≤ j 3. ∃ p1 , p2 . . . pj ∈ Q such that: (q , t) [Fφk (M 0 )] (p1 , a1 a2 a3 . . . aj w) 1 [Fφk Fφk (M 0 )] (p2 , a2 a3 . . . aj w) [Fφk Fφk Fφk (M 0 )] . . . 2 1 3 2 1 [Fφk ...Fφk Fφk (M 0 )] (pj , aj w) j
[Fφk
j+1
4
2
1
Fφk ...Fφk Fφk (M 0 ] j
2
1
(q , w)
New Formulation Equivalence
Finally, the equivalence between the two presentations (the classic and the algebraic) will be proved. Proposition 1 from [10] shows that any adaptive transition that uses adaptive actions before and after the underlying device transition can be replaced by two consecutive transitions using only before (or after) adaptive actions. Thus, each transition has a unique adaptive function and simplifies the notation. The Lemma below states this result. Lemma 1. The adaptive actions may be placed in order to change an automaton (a) before the rule without using after-rule adaptive actions, or (b) after the rule (transition) without using before-rule adaptive actions. Lemma 1, in conjunction with Lemmas 2 and 3, shows that any transition of an adaptive finite-state automaton constructed using the original (classical) way [6] can be mapped to a transition using the new representation. Lemma 2. Any finite state automaton modification performed by the classical elementary adaptive actions (inspection, elimination or insertion) can, alternatively, be performed by proper-search operations, δ-remove operation or δinsertion operation, respectively. Proof (SKETCH). The proof of Lemma 2 is not difficult but lengthy. For space reasons, only the proof sketch will be shown here. All inspection actions can be replaced with proper-search functions that, by definition, perform exactly the same type of search in automaton structure. The equivalence of the elimination action with the δ-remove operation is immediate from definitions, but the equivalence between insertion action and δ-insertion operation is more complex: if the insertion action actually inserts an internal transition (without an adaptive action) then the equivalence is immediate from the definitions, too; however, if the added transition is an adaptive one, the insertion action needs two δ-insertion operations for an equivalent modification effect. Without loss of generality, consider that all the classical adaptive actions must be defined a priori, then, in the new model, they must also be available as empty transitions disconnected from the initial model. Then two δ-insertion operations connect the adaptive action to the proper added transition. Thus, the order of execution of the new automaton transition and the adaptive action are preserved by the adaptive automaton transition function.
282
R.I. Silva Filho and R.L. de Azevedo da Rocha
Lemma 3. Each adaptive action Ai , defined using the original formulation, may be represented by the (new) adaptive function Fφi M 0 . Proof (by the use of Lemmas 1 and 2). By Lemma 1, it is possible to transform any adaptive finite automaton into another one, which makes use of adaptive actions only after the rule has occurred. It is clear from the definitions that any adaptive action Ai is composed by lists of elementary adaptive actions. Let Ai be an adaptive action described by the three sets of elementary adaptive actions (inspection elementary actions, deletion elementary actions, and insertion elementary actions sets). Using Lemma 2, all the elementary actions can be mapped to the proper-search operations, δ-remove operation and δ-insertion operation, respectively, of the new formulation. Thus, there exists an adaptive function Fφi M 0 that, using Lemma 2, maps the same tasks performed by the classical adaptive action Ai . Hence, the (new) adaptive function Fφi M 0 is computationally equivalent to the adaptive action Ai .
Now it is still to be proven that both representational formalisms, the original one and this newer one are, in fact, equivalent and, thus, the new formalism can replace the original (traditional) one without causing any harm to the theory. Theorem 1. Any adaptive finite-state automaton described in the original form can be replaced by the new form without destroying its properties. Proof (by induction on the number of adaptive transitions). Base of the induction: By Lemma 3, for the classical adaptive transition (q , α) → q : Ai , there is an adaptive function Fφi such that (q , αt) [Fφi M 0 ] (q , t) and action A are mapped directly. Inductive hypothesis: Suppose that there is a sequence of n transitions, with 0 ≤ n ≤ : (q k , αk ) → q k+1 : Aj ... (q k+n , αk+n ) → q k+(n+1) : Aj+n in the classical formulation, then there is a sequence Fφj+n . . . Fφj M 0 of the new formulation adaptive functions such that each Fφi , for 1 ≤ i ≤ n, is equivalent to Ai and (q, αk . . . α(k+n) t) ∗[Fφ ...Fφ M 0 ] (q k+(n+1) , t). j+n
j
Inductive step: Suppose that there is a sequence of n + 1 classical formulation adaptive transitions: (q, α) → q : A1 (q , α ) → q : A2 ... (q (n−1) , α(n−1) ) → q (n) : An (q (n) , α(n) ) → q (n+1) : An+1
Adaptive Finite Automaton: A New Algebraic Approach
283
so, by the inductive hypothesis, there is a sequence of n transitions (q, α α . . . α(n−1) t) ∗[Fφ ...Fφ Fφ M 0 ] (q (n) , t) which performs the original n-transitions sen
2
1
quence. Based on this sequence, it is possible to build (q (n) , αw) [Fφn+1 M 0 ]
(q (n+1) , w), that performs the same task as the original An+1 . Thus, combining those results, it is clear that the original (n + 1)-transitions sequence is properly mapped by a (n + 1)-transitions sequence of the new formulation.
5
Conclusion
The adaptive function itself, expressed by transformation Fφ M 0 , restricted the nature and number of parameters allowed for adaptive actions to only two: the underlying device and the transformation pair. Thus, using the Adaptive Algebra to redefine the adaptive function concept allowed the resolution of the two problems mentioned in subsection 2.1 (and in [7]) and was the major objective of this work. As collateral results, the adaptive automaton notation became more compact and sharper. The inclusion of an adaptive algebra opened an entire new branch for the study of the adaptive technology. Now, there are many interesting questions to be analyzed, ranging from a more extensive study of the algebraic properties of the adaptive operations, up to the topological characteristics of the adaptive automata space, as well as the connections to other areas, such as the computational learning theory. 5.1
Future Work
With respect to the essential aspects of the adaptive automata space, there are various topics to be explored. The first one, which is going to appear in a forthcoming paper, consists in verifying the possibility to define a higher order adaptive automaton. In this case, the subjacent device is itself an adaptive automaton.
Acknowledgment The work reported here received support through FAPESP grant 2010/09586-0.
References 1. Carmi, A.: Adapser: An lalr(1) adaptive parser. In: The Israeli Workshop on Programming Languages & Development Environments, Haifa, Israel (July 2002) 2. de Sousa, M., Hirakawa, A.: Robotic mapping and navigation in unknown environments using adaptive automata. In: Adaptive and Natural Computing Algorithms, Part III, pp. 345–348. Springer, Heidelberg (2005) 3. Jackson, Q.T.: Adapting to Babel — Adaptivity and Context-Sensitivity in Parsing: From an bn cn to RNA. Ibis Publications (2006) 4. Jech, T.J.: About the Axiom of Choice. In: Handbook of Mathematical Logic, pp. 345–370. North-Holland, Amsterdam (1977)
284
R.I. Silva Filho and R.L. de Azevedo da Rocha
5. Neto, J.J., Silva, P.S.M.: An adaptive framework for the design of software specification languages. In: Adaptive and Natural Computing Algorithms, Part III, pp. 349–352. Springer, Heidelberg (2005) 6. Mikolajczak, B. (ed.): Algebraic and Structural Automata Theory. North-Holland, Amsterdam (1991) 7. Neto, J.J., Pariente, C.A.B.: Adaptive Automata - A Revisited Proposal. In: Champarnaud, J.-M., Maurel, D. (eds.) CIAA 2002. LNCS, vol. 2608, pp. 158– 168. Springer, Heidelberg (2003) 8. de Azevedo da Rocha, R.L.: An Attempt to Express the Semantics of the Adaptive Devices. In: Advances in Technological Applications of Logical and Intelligent Systems - Selected Papers from the Sixth Congress on Logic Applied to Technology. Frontiers in Artificial Intelligence and Applications, vol. 186, pp. 13–27. IOS Press, Amsterdam (2009) 9. de Azevedo da Rocha, R.L., Neto, J.J.: Adaptive automaton, limits and complexity compared to the Turing machine - in Portuguese. In: Proceedings of the I LAPTEC, pp. 33–48 (October 2000) 10. de Azevedo da Rocha, R.L., Neto, J.J.: An adaptive finite-state automata application to the problem of reducing the number of states in approximate string matching. In: XI Congreso Argentino de Ciencias de la Computaci´ on - CACIC 2005, Concordia, Entre R´ıos, Argentina, pp. 17–21 (October 2005) 11. Rubinstein, R.S., Shutt, J.N.: Self-modifying finite automata. In: IFIP Congress, vol. 1, pp. 493–498 (1994) 12. Salthe, S., Matsuno, K.: Self-organization in hierarchical systems. Journal of Social and Evolutionary Systems 18(4), 327–338 (1995) 13. Sragovich, V.G.: Mathematical Theory of Adaptive Control. World Scientific, Singapore (2006)
Cryptanalytic Attack on the Self-Shrinking Sequence Generator Maria Eugenia Pazo-Robles1 and Amparo F´ uster-Sabater2 1
Instituto Tecnol´ ogico de Buenos Aires Av. E. Madero 399, Buenos Aires, Argentina
[email protected] 2 Institute of Applied Physics, C.S.I.C. Serrano 144, 28006 Madrid, Spain
[email protected]
Abstract. In this paper, a cryptanalysis on the Self-Shrinking Generator a well known sequence generator with cryptographic application is presented. An improvement in the Guess-and-Determine cryptanalytic technique has been proposed. Numerical results that improve other cryptanalysis developed on such a generator are given. In particular, complexities in the order of O(20.2L ) for the amount of intercepted sequence, O(L2 ) for computer memory and O(20.5L ) for execution time (L being the length of the generator register) are obtained. In addition, a specific hardware for a practical cryptanalysis has been proposed. Keywords: sequence generator, cryptanalytic attack, Guess-andDetermine technique, cryptography.
1
Introduction
Symmetric cryptography is commonly classified into block ciphers and stream ciphers. The former ones encrypt a bit block of the original message (plaintext) into a block of ciphered bits (ciphertext), while stream ciphers encrypt bits individually under a time-varying transformation (the cryptographic function). At the present moment, stream ciphers are the fastest among all encryption methods. Moreover, stream ciphers require fewer resources for implementation, e.g. code size or chip area, than block ciphers, and they are attractive for use in constrained environments such as cell phones (GSM technologies with the algorithms A5/1 and A5/2, see [7]), Bluetooth (algorithm E0, see [1]) or Microsoft Word and Excel (algorithm RC4, see [13]). A stream cipher cryptosystem consists in a public algorithm or sequence generator (keystream generator) and a secret key that is known only by the two
This work was supported in part by CDTI (Spain) and the companies INDRA, Uni´ on Fenosa, Tecnobit, Visual Tools, Brainstorm, SAC and Technosafe under Project Cenit-HESPERIA; by Ministry of Science and Innovation and European FEDER Fund under Project TIN2008-02236/TSI.
ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 285–294, 2011. c Springer-Verlag Berlin Heidelberg 2011
286
M.E. Pazo-Robles and A. F´ uster-Sabater
parties (sender and receiver). Stream ciphers are designed to generate from a short seed or key a long sequence of pseudorandom bits, the so-called keystream sequence. Such a sequence is XORed with the plaintext (in emission) in order to obtain the ciphertext or with the ciphertext (in reception) in order to recover the plaintext. Security of a stream cipher resides in the characteristics of the keystream sequence: long period, good statistical properties and high linear complexity, see [3], [5] and [11]. Most keystream generators are based on maximal-length Linear Feedback Shift Registers (LFSRs) [6] whose output sequences, the so-called PN -sequences, are combined in a non-linear way in order to produce pseudorandom sequences of cryptographic application. Combinational generators, non-linear filters, clockcontrolled generators, irregularly decimated generators, . . . are just some of the most popular keystream sequence generators [11]. Stream ciphers are used to give cryptographic security to communication systems with requirements of speed and synchronism. One popular example of stream cipher is the Self-Shrinking Generator (SSG) [10]. The European Union via the Stork Project [14] proposed the cryptographic community to break such a keystream generator by means of any cryptanalytic technique that improves the TMTO (Time Memory-Trade Off) attack [8]. In this work, an effective cryptanalytic attack applied to the self-shrinking generator for different lengths L ≤ 120 of its LFSR is developed. More precisely, it consists in a method based on previous attacks on this generator [16] that obtains an improvement of several orders of magnitude by regarding other published results. This improvement let us assure that the generator can be broken in real time. In fact, the complexity for computer memory (notated CM ) is O(L2 ) and O(20.2L ) for intercepted sequence (notated CD ). On the other hand, the execution time (CT ) is in the order of O(20.5L ). Being able to decrease this value when compared with previous works ([12], [16]) makes our cryptanalysis adequate to be carried out in real time with a dedicated hardware.
2
The Self-Shrinking Generator
The SSG was designed by Meier and Staffelbach [10] for potential use in stream cipher applications. The SSG is easy to be implemented and consists of a unique LFSR of L stages and primitive polynomial [6]. This register generates a pseudorandom sequence {sn } that is self-decimated given rise to the self-shrunken sequence {zn } or output sequence of the SSG. Such an output sequence will be the keystream sequence to be used for cryptographic purposes. The decimation rule is quite simple. In fact, let (s2i , s2i+1 ) (i = 0, 1, 2, . . .) be pairs of consecutive bits of the sequence {sn }, then the decimation rule proceeds as follows: 1. If s2i = 1, then zj = s2i+1 . 2. If s2i = 0, then s2i+1 is discarded.
Cryptanalytic Attack on the Self-Shrinking Sequence Generator
287
That is, if the first bit of the pair under consideration is 1, then the second bit is included into the output sequence. On the contrary, if the first bit of the pair under consideration is 0, then the second bit is rejected. In this way, specific bits of the sequence {sn } are removed while the remaining bits make the sequence {zn } or self-shrunken sequence. The key of this generator is the initial state of the LFSR. Periods, linear complexities and statistical properties [10] make the self-shrunken sequences very adequate for their application in stream cipher. In brief, the SSG is a simplified version of the Shrinking Generator, suggested by Coppersmith et al. [2], which satisfies the same decimation rule but includes in its design two maximallength LFSRs [4] as well as a particularization of the Generalized Self-Shrinking Generator [9] that generate a family of different keystream sequences. The sequence {sn } generated by the LFSR is made out of two different subsequences {cn } and {bn} corresponding to the bits of {sn } with sub-indices even or odd, respectively. That is: ci = s2i
∀i ≥ 0
(1)
∀i ≥ 0.
(2)
bi = s2i+1
At the same time, the two sequences correspond to the same PN -sequence {sn } but shifted a distance of value 2L−1 bits. The existence of such a shift allows one to express one sequence in terms of the other [16] as follows: bi =
L−1
hj ci+j ,
(3)
j=0
where hi are the binary coefficients of h(x) a polynomial of degree at most L − 1 defined by: L−1 h(x) ≡ x2 mod Pc (x), (4) Pc (x) being the LFSR characteristic polynomial [6]. Table 1. h(x) for different polynomials of degree L L
Pc (x) 36
x
36 40 52
x
40
38
+x
+x
25
+x
22
h(x) +1
6, 13, 24 20
+x
+1
x52 + x49 + 1 100
+x
37
11, 29, 30 2, 25, 28
100
x
+1
19, 32, 69
278
x278 + x273 + 1
3, 137, 142
455
x455 + x341 + x230 + x116 + 1
3, 4, 62, 118, 171, 176, 228, 229, 287, 343, 401
288
M.E. Pazo-Robles and A. F´ uster-Sabater
In this way, the odd bits bi of the sequence {sn } can be expressed in terms of the even bits, that is ci+j with 0 ≤ j < L − 1. Recall that the computation of h(x) for large values of L is not a trivial task. In fact, in this work an ad-hoc programme has been written to carry out this computation based on modular arithmetic properties. Table 1 shows the obtained results for different L-degree characteristic polynomials. The codification of the polynomial h(x), for instance, in the case (6, 13, 24) corresponds to the polynomial h(x) = x6 + x13 + x24 .
3
The Proposed Attack
This cryptanalytic attack can be classified within the techniques with a statistic component and an algebraic component. As mentioned above, two different subsequences {cn } and {bn } can be found inside the sequence {sn }. In fact, breaking the generator means to find the initial condition (the initial bits) for {cn } and {bn } or a particular sub-index from which onwards it is possible to continue the generation of the sequence {zn }. 3.1
General Idea
The general idea of this cryptanalysis can be summarized as follows: from the knowledge of N bits of {zn } (intercepted bits) and from the supposition of l bits of the sequence {cn }, e.g. (c0 , c1 , . . . , cl−1 ), a system of linear equations is written via the equation (3) to determine: 1. The remaining bits of {cn }, notated (cl , cl+1 , . . . , cL−1 ) 2. The corresponding bits of {bn }, notated (b0 , b1 , . . . , bL−1 ). Once the L first bits of both sequences are known a certain amount of the self-shrunken sequence {zn } is generated. Then these bits are compared with those of the intercepted sequence. If both portions of sequence coincide, then the generator is broken. Otherwise we shift one bit over the intercepted sequence and the same process is repeated. If there is no solution for all the possible shifts over the intercepted bits, then a new supposition of l bits in {cn } is taken and the same process is repeated. Let us see a simple example. Example 1 : Let Pc (x) = x5 + x3 + 1 be the characteristic polynomial of a maximal-length LFSR. The sequences generated for a particular initial condition are: {cn } = {1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, . . .} {bn } = {0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1 . . .} {zn } = {0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, . . .} We take l = 3, N = 5 intercepted bits starting at z0 and the supposition (c0 , c1 , c2 ) = (1, 0, 1). From these data a system of linear equations is set and if possible it will be solved. In fact, the remaining bits to be determined are (c3 , c4 ) and (b0 , b1 , b2 , b3 , b4 ). As c0 = 1, it means that b0 = z0 = 0. As c1 = 0, it means that b1 is an unknown.
Cryptanalytic Attack on the Self-Shrinking Sequence Generator
289
It must be noticed that only the bits ci = 1 add equations to the system. In the previous supposition we have that q the number of 0’s equals 1. Next, as c2 = 1, it means that b2 = z1 = 1. From the characteristic polynomial Pc (x) of the LFSR, the polynomial h(x) = x2 + x3 and the equation (3), the following system of linear equations is written: b0 b1 b2 b3 b4
= c2 + c3 = c3 + c4 = c0 + c3 + c4 = c0 + c1 + c3 + c4 = c0 + c1 + c2 + c3 + c4 .
Taking into account that from the supposition the values (b0 , b2 ) are known, then we can solve the previous system for the unknowns (c3 , c4 ) and (b1 , b3 , b4 ). Once the initial conditions are known: (c0 , c1 , c2 , c3 , c4 ) = (1, 0, 1, 1, 1) and (b0 , b1 , b2 , b3 , b4 ) = (0, 0, 1, 1, 0), then a portion of keystream sequence is generated zresult = {0, 1, 1, 0, 0, . . .}, later zresult is compared with the intercepted bits and if both portions match , as it is the case here, then we can say that the generator is broken. 3.2
Cryptanalytic Algorithm
Fist of all, an additional notation is introduced: {Ck }l−1 with (1 ≤ k ≤ Qreal ) is the k-th supposition for the first l bits of 0 the sequence {cn }. Qreal is the number of el suppositions to be analyzed. m is the maximum number of unknowns that remain undetermined. Can indicates the optimal amount of bits to be considered for a valid comparison among intercepted bits and generated bits. Next a pseudo-code of the programmed algorithm is given in Fig. 1. Although the underlying idea of this attack is simple and consequently effective, there are several remarks that must be taken into account. Remark 1: The greater the number of 0’s in each supposition is, the lesser the number of equations introduced in the system will be. Consequently, the number of undetermined bits will be greater too. Remark 2: Not all the suppositions solve the system of linear equations, thus they have to be selected very carefully. It is the selection of suppositions what makes this attack more effective than those using the basic Guess and Determine technique. Remark 3: Defining the block of optimal suppositions means not only selecting a block with the minimum number of suppositions but also that such suppositions are able to solve the greater amount of bits (c0 , c1 , . . . , cL−1 ). In this way, a balance between time complexity and intercepted sequence complexity is achieved.
290
M.E. Pazo-Robles and A. F´ uster-Sabater
Input: Pc (x), L, l, a portion of N intercepted bits, pointer z pointing to the first intercepted bit z0 , h(x) and a supposition block {Ck }l−1 0 initialize l ∼ = L/2, Can = 2.5 ∗ L for k ← 1 to Qreal do While pointer z < N : Step 1: Set the system of linear equations Step 2: Solve the system with a bound number m of undetermined unknowns Step 3: Compute (c0 , c1 , . . . , cL−1 ) and (b0 , b1 , . . . , bL−1 ) Generate Can bits of sequence {zn } Step 4: Compare intercepted bits with generated bits If they are equal, then break Step 5: z = z + 1 end for While z = 0, Take a new supposition end for k Output: (c0 , c1 , . . . , cL−1 ) and (b0 , b1 , . . . , bL−1 ) with which the remaining shrunken sequence is generated. Fig. 1. Pseudo-code of the cryptanalytic algorithm to break the Self-Shrinking Generator
3.3
Computation of the Supposition Block
A natural way of facing the problem of the supposition selection is to take a great deal of suppositions. This is the method found in [12] and [16]. Nevertheless, in this work the size of the block is much smaller as we choose just the optimal suppositions. That is, we choose just the suppositions able to solve the system of linear equations. It must be noticed that a part of the bits (c0 , c1 , . . . , cL−1) may not be solved. In that case, from the solved ones the remaining bits will be determined by exhaustive research. On the other hand, in a PN -sequence the distance among blocks of l bits with a number of q 0’s depends on the degree L of the characteristic polynomial. This distance will tell us how many bits of intercepted bits are necessary to break the cryptosystem with a high probability of success. The optimal resolution of the system is sensitive to the involved equations. In fact, if the bit ci is a ’0’ in the proposition {Ck }, then the corresponding bit bi is unknown and its corresponding equation (via equation 3) is not included into the system. The optimization idea is to reject suppositions that discard equations considered as important to solve the greatest number of bits. The way of filtering all the possible suppositions {Ck }l−1 with a number q of 0’s is just 0 to select those {Ck }l−1 with 1’s in the important equations. 0
Cryptanalytic Attack on the Self-Shrinking Sequence Generator
4
291
Numerical Results
Execution time for this algorithm is computed by considering the number of suppositions and the number of bits in the intercepted sequence. In this way, we can compute the number of trays to break the generator. 4.1
A Comparison with Other Cryptanalysis
Numerical results are depicted in the following Tables. In Table 2, the results of this work are represented for different parameters: Qreal number of suppositions, N number of intercepted bits, Bnr number of undetermined bits in solving the system, CT time complexity, CM memory complexity, CD complexity for intercepted data and Psucc percentage of success in the cryptanalysis. For every value of L, 10 different LFSRs have been analyzed. The values depicted are the average of the obtained results. Table 2. L vs. different orders of magnitude L 36
Qreal 1736
N 500
Bnr 2
CT 11
O(2
11
CM 9
∗2 ) 10
Psucc
9
L
2
2
87%
10
40
2736
950
3
O(2
L
2
100%
52
12376
5500
4
O(214 ∗ 212 )
L2
212
85%
O(228 ∗ 221 )
L2
221
100 462411533 120
2
34
221 12 − 13 25
2
11
34
O(2
∗2 )
CD
2
25
∗2 )
2
L
25
2
97% 0.2∗L
= O(2
) 97%
In Table 3, a comparison of the results obtained by different authors for different parameters l, Qreal , N , q and L = 40. It can be noted that in [16] the number q of 0’s is greater than in the other cases so that the number of undetermined bits will increase as well as the time complexity. Table 3. Parameter Comparison from different authors for L = 40 L = 40
Qreal
N
q
Mihaljevic (l = 20)
2L−l = 1048576
106
5
Pazo-Robles et al. (l = 20)
2736
700 − 800
5
Zhang (l = 25)
222
O(28 )
9-10
In Table 4, the complexities CT , CM and CD for different authors are compared. It must be noticed that in [16] Zhang et al. try to solve as many equations as possible while in the present work we solve just those equations that minimize the number of undetermined bits. Due to this fact, our maximal time complexity is O(20.6L ) while their maximal time complexity achieves the value O(2L ).
292
M.E. Pazo-Robles and A. F´ uster-Sabater Table 4. Complexities from different authors Author
CT O(20.7∗L )
Mihaljevic Pazo-Robles et al.
0.5∗L
O(2
) − O(2
O(20.7∗L ) − O(2L )
Zhang
5
0.6∗L
)
CM
CD
O(L)
O(20.5∗L )
2
O(L )
< O(20.25∗L )
O(L2 )
O(20.2∗L )
Hardware Implementation
This section is just a sketch of a hardware implementation of the cryptanalysis developed in the previous sections. Breaking the SSG requires an adequate programming and an physical environment to be implemented. The proposed solution that unifies both conditions is the so called FPGA (Fast Programmable Gate Array). This is a bit-oriented logic without operating system and with all the components working at speeds near to that of the clock. The parallel programming is adequate for this kind of logic and consequently it would be our choice regarding cryptanalytic purposes. The logic programmable components of a FPGA have the functionality of basic logic gates such as AND, OR, XOR, NOT or even more complex combinational functions. The implementation of a LFSR by means of FPGAs is very easy and can be carried out by using configurations already available in FPGA chips. Table 5. Supposition generation time from different L and programming environment L
Ns
M atlab
Labview
F P GA
36
1736
26 sec
0, 86 sec
1, 736 sec
40
2736
41 sec
1, 4 sec
2, 736 sec
52
12376
186 sec
6, 2 sec
12, 375 sec
100
462411533
80 days
2, 5 days
7 − 8 min
In order to analyze computation times for very simple procedures, diverse executions with different programming languages MatLab, Labview and an execution running on a FPGA board (NI-RIO-EVAL-101 National Semiconductors) [15] have been compared. For instance, the generation of a supposition in each programmed environment provides us with the following results: – Times in MatLab: The process of each supposition takes 15 msec. – Times in Labview : The process of each supposition takes 0.5 msec. – Times on the FPGA board : The process of each supposition takes 0.001 msec = 1 µsec
Cryptanalytic Attack on the Self-Shrinking Sequence Generator
293
The transition from MatLab to Labview means a reduction of 30 times the generation time. The transition from Labview to FPGAs means a reduction of 100 - 500 times the expected time for the generation of a supposition in Labview. Table 5 depicts a comparison among different L and programming environments. We believe that the global algorithm implementation on FPGA boards will give very satisfying numerical results and that the SSG could be broken in real time for L > 100.
6
Conclusion
In this work, a cryptanalysis on the Self-Shrinking Generator with complexity orders that improve those of other authors has been presented. In fact, our magnitude orders are O(20.5L ) for time complexity, O(20.25L ) for the amount of intercepted sequence and O(L2 ) for computer memory. The pre-computation is time-low, that is the reason why this cryptanalysis is conceivable in acceptable computational times. In addition, the target of the Stork Project: breaking the SSG with complexities less than those of the Time Memory-Trade Off has been accomplished. Finally, a hardware development scheme to break the SSG in reasonable time has been sketched as a future work.
References 1. Bluetooth, Specifications of the Bluetooth system, Version 1.1, http://www.bluetooth.com/ 2. Coppersmith, D., Krawczyk, H., Mansour, Y.: The Shrinking Generator. In: Stinson, D.R. (ed.) CRYPTO 1993. LNCS, vol. 773, pp. 22–39. Springer, Heidelberg (1994) 3. F´ uster-Sabater, A.: Run Distribution in Nonlinear Binary Generators. Applied Mathematics Letters 17(12), 1427–1432 (2004) 4. F´ uster-Sabater, A., Caballero-Gil, P.: Strategic Attack on the Shrinking Generator. Theoretical Computer Science 409(3), 530–536 (2008) 5. F´ uster-Sabater, A., Caballero-Gil, P., Delgado-Mohatar, O.: Deterministic Computation of Pseudorandomness in Sequences of Cryptographic Application. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 621–630. Springer, Heidelberg (2009) 6. Golomb, S.W.: Shift Register-Sequences. Aegean Park Press, Laguna Hill (1982) 7. GSM, Global Systems for Mobile Communications, http://cryptome.org/gsm-a512.htm 8. Hellman, M.: A Cryptanalytic Time-Memory Trade-Off. IEEE Trans. Informat. Theory 26(4), 234–247 (1980) 9. Hu, Y., Xiao, G.: Generalized Self-Shrinking Generator. IEEE Trans. Inform. Theory 50, 714–719 (2004) 10. Meier, W., Staffelbach, O.: The Self-shrinking Generator. In: De Santis, A. (ed.) EUROCRYPT 1994. LNCS, vol. 950, pp. 205–214. Springer, Heidelberg (1995) 11. Menezes, A.J., et al.: Handbook of Applied Cryptography. CRC Press, New York (1997)
294
M.E. Pazo-Robles and A. F´ uster-Sabater
12. Mihaljevic, M.J.: A Faster Cryptanalysis of the Self-Shrinking Generator. In: Pieprzyk, J.P., Seberry, J. (eds.) ACISP 1996. LNCS, vol. 1172, pp. 182–189. Springer, Heidelberg (1996) 13. Rivest, R.L.: The RC4 Encryption Algorithm. RSA Data Sec., Inc. (March 1998) 14. Stork Project, http://www.stork.eu.org/documents/RUB-D6-2-1.pdf 15. Xilinx, http://www.xilinx.com 11. National Instruments, http://www.ni.com/pdf/products/us/cat-flexriofpga.pdf 16. Zhang, B., Feng, D.: New Guess-and-Determine Attack on the Self-Shrinking Generator. In: Lai, X., Chen, K. (eds.) ASIACRYPT 2006. LNCS, vol. 4284, pp. 54–68. Springer, Heidelberg (2006)
About Nonnegative Matrix Factorization: On the posrank Approximation Ana de Almeida CISUC - Center for Informatics and Systems, University of Coimbra, Portugal
[email protected]
Abstract. This work addresses the concept of nonnegative matrix factorization (NMF). Some relevant issues for its formulation as as a nonlinear optimization problem will be discussed. The primary goal of NMF is that of obtaining good quality approximations, namely for video/image visualization. The importance of the rank of the factor matrices and the use of global optimization techniques is investigated. Some computational experience is reported indicating that, in general, the relation between the quality of the obtained local minima and the factor matrices dimensions has a strong impact on the quality of the solutions associated with the decomposition. Keywords: Signal processing, non negative matrix factorization, feature extraction, dimensionality reduction.
1
Introduction
This paper focus on a particular optimization problem that has potential uses for a variety of different applications. Nevertheless, in the following, it will mainly be addressed within the framework of Image Signal Processing. Therefore, our motivation and examples will come mainly from this area. The problem to be dealt with can be stated as: given a non-negative matrix, V ∈ Rm×n , find a decomposition, V = W H, where W ∈ Rm×r , H ∈ Rr×n, such that W and H are also nonnegative matrices. This problem is commonly known as Nonnegative matrix factorization, or, NMF for simplicity. Nonnegative Matrix Factorization has been presented (and used) as a method capable of finding the underlying parts-based structure of complex data. It is accepted that NMF can produce both object detection and recognition with characterization of a pattern (or classification of different patterns) as well as dimension reduction. The method has been intensively applied (with some degree of satisfaction) in diverse fields of science such as: biomedical applications, face and object recognition, amongst many others, ranging from data mining to semantic extraction. In fact, for all of these, V is no more than a data matrix and this type of decomposition becomes necessary mainly due to hardware limitations for data analysis leading to the need of rank optimization (dimensionality A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 295–304, 2011. c Springer-Verlag Berlin Heidelberg 2011
296
A. de Almeida
reduction). There are, of course, other known decomposition techniques but only NMF implies that the nonnegative constraints in the model are obeyed. When images are concerned, this is especially important since any pixel that composes the image has a non-negative value [1]. Within image signal processing each data matrix V is made up of several images showing a composite object in many articulations or poses. In a simplified way this means that: each column (line) in V can be an image or a sequence of images; the columns in W represent the basis elements for this data space; the columns of H denote the coefficient sequences representing n images in the basis elements. The importance of NMF for Video Signal Processing is twofold: in one hand, it can be applied as a compression tool (previous to coding) since the factor matrices tend to be sparse. In fact, if a good enough low-rank approximation is produced there are, of course, much less matrix entries to code than when simply using V . Secondly, NMF can be used to detect scene boundaries or special features within the video, which can be used in video summarization/segmentation tools or even for devising new methods for motion detection. However, the crucial issue behind any NMF application within signal processing approximation is the assurance that the decomposition presents good quality solutions, i.e., the information loss is not important for the final result. This paper is organized as follows: the next section describes the most common formulations and algorithmic approaches, focusing in constrained optimization. Section 3 presents some computational experiments, followed by a conclusions and trends for future work section.
2
Classical Formulations and Algorithms
2.1
Summary on Previous Works
The most common formulation for the NMF as an optimization problem is the minimization of the Fröbenius norm1 . Given a non-negative matrix, V ∈ Rm×n , solve the following non-linear minimization problem 1 2 min 2 ||V − W H||F s.t. Wn×r ≥ 0 , Hr×m ≥ 0 .
(1)
Clearly, the product W H is an approximate factorization of rank r. Why approximate? We shall soon see that an appropriate decision on the value of r is critical in practice. But the choice of the parameter r is, usually, problem dependent being, however, generally chosen so that 1 < r << min{m, n}. Therefore, W H can be thought of as a compressed form of V . Important challenges affecting the numerical optimization of (1) include the existence of local minima due to the non-convexity of the objective function in both W and H. In the field of image processing, since our eyes cannot detect but 1
There are alternative formulations but the minimization of the the Fröbenius norm is normally chosen and the best characterized one so far.
About Nonnegative Matrix Factorization: On the posrank Approximation
297
a certain level of image distortion, if a good enough approximation is obtained then it is considered that a good solution was found. We can briefly characterize the numerical approaches used for solving (1) that can be found in the related literature with the following four general techniques: a) Alternating Constrained Least-Squares; b) Multiplicative Update Rules (Fixed Point approach); c) Projected Gradient Descent Methods; d) Unconstrained Newton-type Methods. The first type of methods can be found right at the introduction of NMF in a work dating from 1994 by Paatero and Tapper [9]. The authors attempted to perform the analysis of measurements taken from environmental data, trying to find a small number of main causes that could explain a large set of measurements. Their analysis was merely empirical and the work did not present any theoretical foundations for the use of NMF or the algorithm that was described. They used several different initializations for an alternating least squares algorithm, specific built to the end they pursued, attempting to obtain a global solution. Independently, Lee and Seung introduced the concept of NMF in a machine learning environment [6]. They used V as a matrix of training examples and the minimization of the Fröbenius norm to extract semantic features as basic patterns for faces (eyes, lips, noses, etc.). Again, several applications of the algorithm with different initial points were attempted. Although they provided a proof of convergence for their multiplicative updating-rule algorithm, this proof was recently corrected by C. Lin [5] since it only worked for interior points in the feasible solutions region. The authors also claimed that their updating rules were fast. Nevertheless, each optimization may require extensive amounts of processing time. Since Lee and Seung seminal papers, a great deal of published work has been devoted to the extension, application and improvement of their algorithm. However, more theoretical and systematic approaches we can find only a few, being one of the more consistent the work by Plemmons and its group (e.g. [11] and 2 [12]). By using a transformation on the variables, xij ≥ 0 ⇐⇒ xij = yij and leave yij unconstrained (for xij representing either wij or hij ), turn (1) into an equivalent unconstrained problem. Subsequently, the appropriate Karush– Kuhn–Tucker conditions are determined and Newton-type approaches used in order to find stationary points. Independently of the use of constrained or unconstrained optimization, the question of good initializations remains an open issue. The authors used a Monte-Carlo strategy to generate different starting points to test their approach under various applications. A more specialized but very coherent approach and with interesting applications can be found in the works by Chichoki, Zdunek and others, in the area of blind image separation. Over the last three years, this group has provided this field of research with important contributions, namely for approximating Tensor Factorizations [1]. 2.2
Constrained Optimization: Related Issues
Constrained optimization appeals to us as being capable of presenting better results than unconstrained optimization. However, due to the large-scale matrices
298
A. de Almeida
usually used within image and signal processing, Lee-Seung type approaches are not efficient at all and there is a pressing need to overcome the complexity issues encountered. Furthermore, we pursue good quality solutions since we need good quality for the recovered information. After a throughout research into theoretical results related to nonnegative decomposition of nonnegative matrices we found out that the mathematical foundations specific for NMF are still scarce. The more important general results for the intended optimization can be found in works published in the early eighty’s by Higham and Hazewinkel ([4][3]). The latter proved some properties that are numerically relevant for the characterization of the existence of solution namely, a condition for the existence of an exact factorization. The first result presented is quite obvious and states that, whenever there is one exact factorization, there are an infinity of possible pairs for the decomposition. Theorem 1. If (W ∗ , H ∗ ) is a solution for (1), then, so is (aW ∗ , H ∗ /a), ∀a > 0. In fact, a stronger statement can be derived since, if (W ∗ , H ∗ ) is a global minimum for (1), so are all the pairs of the form (W ∗ D, D−1 H ∗ ) for any nonnegative invertible matrix D. In [2] we can find the proof of the same result in terms of simplicial cones, characterizing not only the existence of solutions for (1) but also the singleness: Property 1 (Donoho et.al., 2004). If the elements of V are strictly positive then there exists an infinity of solutions for the optimization of problem (1). Moreover, in order to have an unique solution, V must have zero value elements. In [3] Hazewinkel presents the following definition, that allows to state a far more useful condition for the existence of an exact factorization for a given matrix V : Definition 1 (Hazewinkel, 1984). Given Vm×n ≥ 0, the minimum r such that there is Wm×r , Hr×n for the exact factorization V = W H, is the positive rank for V , posrank(V ). Theorem 2 (Hazewinkel, 1984). Given Vm×n ≥ 0, rank(V ) ≤ posrank(V ) ≤ min{m, n}. Thus, obtaining exact factorizations is directly dependent on the value of r that is used. Nevertheless, the determination of the values for posrank is still unknown. But this also means that, when thinking of global optimization, the estimation of this value is especially important for the success of the approach since, if we assure that exact factorizations exist, than we known the minimal possible value for the objective function is zero! However, there is a major issue needing to be tackled: what is the “best” factorization dimension, r, with which a NMF algorithm can determine as few basis as possible but still obtain good compression of data? Is this value algorithm dependent? And, finally, is it possible to determine it? Next, we introduce the first steps that serve to show the importance of the posrank value on the quality of the approximations to NMF.
About Nonnegative Matrix Factorization: On the posrank Approximation
3
299
Experimental Work
This section experimental work intends to assess the influence of the posrank value on feature extration for image signal processing. In all the tests a global optimization software, GAMS/MINOS2 , was used in order to solve the nonlinear optimization problem (1). This is still the GAMS non-linear optimization problem (NLP) solver that is used the most. MINOS has been developed at the Systems Optimization Laboratory at Stanford University. Linearly constrained models are solved with a very efficient and reliable reduced gradient technique, that utilizes the sparsity of the model. The results obtained were compared with the results from approximation algorithms, namely, with Lin’s algorithm [5]. This is an alternative non-negative least squares method using a projected gradients approximation algorithm. The same initialization factor matrices were used with both approaches. 3.1
MINOS Decompositions for Several Values of Parameter r
The first experiment for the assessement of the importance of the values r that are chosen for the factorization uses is a special block matrix, V , that is specified by: V = [V1 : u1 u2 . . . un−r ] , ui ∈ Rm , V1 ∈ Rm×r (2) r where, ui = [v1,r+i . . . vn,r+i ]T , i = 1, . . . , n − r and vk,r+i = j=1 vkj αji , ∀k = 1, . . . , m, with αji ≥ 0, ∀i, j. It can easily be shown that at least one exact factorization V = W H exists and is given by W = V1 and ⎡ ⎤ .. . α . . . α 11 1 n−r ⎥ ⎢ ⎢ .. ⎥ H = ⎢ Ir ... ... . . . (3) . ⎥ ⎣ ⎦ .. . αr1 . . . αr n−r Consider the image presented in Figure 1 for which the underlying pixel matrix (with values in [0, 1]24×48 ) obeys the special form of (2). In the first experiment the parameter r is set to 2. Running MINOS/GAMS optimization software to solve the non-linear optimization problem (1) with this particular matrix as V . Even using several different initial points the known optimal solution was never found. The best examples obtained (in terms of the quality of the images obtained) are the ones presented in Fig. 2. Clearly, not even a fairly good approximation was obtained. In fact, for r = 2 all the obtained approximations are quite poor, either speaking in terms of objective function values, either in terms of quality of recovered image (remember that the optimum o.f. value is 0). These results are easily explained by the fact that theorem 2 was not taken into account. The rank of the pixel matrix V for the original figure is 4. Using 2
http://www.gams.com/default.htm
300
A. de Almeida
Fig. 1. An image whose pixel matrix V24×48 obeys the special form of (2)
r = 2 for the factorization guarantees that an exact decomposition can never be found. These experiments seem to indicate also that it can be very difficult to find good enough approximations in terms of image quality, although the compression rate would be better than using higher values for r. Initial Point I: wik = 0.5 = hkj
Initial Point II: wik = 0.1, hkj = 0.9
5
5
10
10
15
15
20
20
5
10
15
20
25
30
35
40
O.F. value = 4.7
45
5
10
15
20
25
30
35
40
45
O.F. value = 7.01
Fig. 2. Images obtained with pairs (W, H) when r = 2
As a matter of fact, even if the exact value for posrank is not known, using values for r so that rank(V ) = 4 ≤ r ≤ min{24, 48} we did find excellent approximations for the original image. As an exemple, for r = 4 and using the same initial points I and II from Figure 2, we obtained the image depicted in Figure 3. Surprisingly enough, even using initial point II, we can see that not only the value of the objective function has decreased but also there is some sort of better definition for the recovered image when compared with the images in Figure 2. Going further with the experiment, that is using r = 12 (Fig. 4), for initial point I and as it would be expected MINOS software terminates with a very good approximation in terms of final image. However, for initial point II, the results continue to present worst quality for the recovered image, even though it improved slightly over the previous one. We can also see yet another reduction in the final objective function value when compared with the previous results.
About Nonnegative Matrix Factorization: On the posrank Approximation Initial Point I: wik = 0.5 = hkj
Initial Point II: wik = 0.1, hkj = 0.9
5
5
10
10
15
15
20
20
5
10
15
20
25
30
35
301
40
45
5
O.F. value = 0.0
10
15
20
25
30
35
40
45
O.F. value = 5.12
Fig. 3. Images recovered with pairs (W, H) when r = 4 Initial Point I: wik = 0.5 = hkj
Initial Point II: wik = 0.1, hkj = 0.9
5
5
10
10
15
15
20
20
5
10
15
20
25
30
35
40
45
5
O.F. value = 0.0
10
15
20
25
30
35
40
45
O.F. value = 2.801
Fig. 4. Images obtained with pairs (W, H) when r = 12
3.2
MINOS versus Lin’s Approximation Algorithm
Experimenting with a known example, we used a reduced Swimmer Matrix, where we can see a stick figure with four limbs in articulated positions, depicting a swimmer (in a kind of) swim (Fig. 5). The torso remains invariant which is not relevant for our rank approximation analysis.
5 10 15 20 25 30 35 40 45 20
40
60
80
100
120
Fig. 5. Original Swimmer image with 10 different positions
302
A. de Almeida
The rank of the original swimmer image matrix V is 25 and the value of the posrank or even if an exact approximation can be obtained for some value of r are not known. In the first experience with MINOS the value of r was set by increasing values until it was equal or bigger than the original data matrix rank. As it can be checked with the sequence of images presented in Figure 6, even for the latter values of the parameter r, it was not possible to find any good enough approximation for the original image. It is noticeable that, with the increase of the value of r, the image definition also increases and the values of the objective functin decrease. R = 25 > rank(V)
R = 35 > rank(V)
R = 40 rank(V)
5
5
5
10
10
10
15
15
15
20
20
20
25
25
25
30
30
30
35
35
35
40
40
40
45
45 20
40
60
80
100
120
O.F. value = 16.8841
45 20
40
60
80
100
120
O.F. value = 19.0514
20
40
60
80
100
120
O.F. value = 10.128
Fig. 6. Images obtained with pairs (W, H) from MINOS when r = 25, 35, 40 and inicializations wik = 0.5 = hkj
Applying the approximation algorithm presented in [5] (an alternative nonnegative least squares using projected gradients approximation algorithm) and comparing with the previous approximation results for the same initialization factor matrices, Lin’s algorithm performs even worst than MINOS in terms of final image recovery. R = 40 rank(V) 5 10 15 20 25 30 35 40 45 20
40
60
80
100
120
O.F. value = 14.252871 Fig. 7. Image obtained with LIN’s Projected-gradient algorithm using the same inicialization from Fig. 6
Despite the fact that the image recovery is clearer in the tests obtained by MINOS using r = 25 and r = 35, the value for the o.f. for Lin’s approximation (Figure 7) is smaller than any of the previous. Interesting enough, the image for r = 35 in Figure 6 is slightly more defined than the one for r = 25 in spite
About Nonnegative Matrix Factorization: On the posrank Approximation
303
of each of the objective values. This seems an indication for the fact that the minimization function is not adequate for the particular problem at hand (image recovery) in terms of visual quality.
4
Conclusion
This paper aims to bring relevance to some important issues that arise when using nonnegative matrix factorization within Image Signal Processing for a pixel image matrix Vm×n . Namely, that setting the decomposition parameter r to values rank(V ) ≤ r ≤ min{m, n} is an important property to assure that good quality images can be achieved. When there is an increase in the value of r, although larger factor matrices are obtained, there is a decrease on the potential loss of information, which results in better approximations. Nevertheless, a good initialization (or seeding) still is an important open problem, even when the rank of V is known. In terms of image recovery quality, the experiments seem to point out that the minimization of the Fröbenius norm may not be the most adequate optimization function to be used: the values of the final solutions do not reflect the clearness of the associated recovered image. Since good quality solutions are a basic need, global optimization seems a very interesting option. This implies finding the exact factorization, which amounts to a warranty that we can obtain factor matrices that will reproduce the data without any inserted errors. To pursue global optimization, a good estimation for the positive rank is, therefore, most needed. On the other hand, active-set methods can find stationary points (local minima) for small values of m, n and r, but are unable to process optimization problems when these values are large, which is the case for video and image data. The dimensions of the problems within this areas tend, in practice, to be quite large so Projected-Gradient type algorithms can be useful for dealing with this kind of dimensions. The most famous one is due to Lin ([5]) but it still seems to be quite slow. Furthermore, when compared with MINOS over the same initializations, the results obtained with Lin’s algorithm were allways very poor. Sparsity in W and H is quite important, not only for algorithms to efficiently compute the decomposition pair of matrices but also for subsequent signal processing phases, so that it can decrease the amount of further loss of information. It should therefore also be explored and integrated into NMF algorithms.
References 1. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. John Wiley, Chichester (2009) 2. Donoho, D., Stodden, V.: When Does Non-negative Matrix Factorization Give a Correct Decomposition into Parts? In: Thrun, S., et al. (eds.) Proceedings of Advances in Neural Information Processing, NIPS 2003, vol. 16. MIT Press, Cambridge (2004)
304
A. de Almeida
3. Hazewinkel, M.: On positive vectors, positive matrices and the specialization order. CWI report PM-R8407 (1984) 4. Higham, N.J.: Matrix Nearest Problems and Applications. In: Gover, M., et al. (eds.) Applications of Matrix Theory, pp. 1–27. Oxford University Press, Oxford (1989) 5. Lin, C.: Projected Gradient Methods for Non-negative Matrix Factorization. Neural Computing 19, 2756–2779 (2007) 6. Lee, D.D., Seung, H.S.: Learning the parts of objects by non–negative matrix factorization. Nature 401, 788–791 (1999) 7. Lee, D.D., Seung, H.S.: Algorithms for nonnegative matrix factorization. In: Advances in Neural Information Processing, vol. 13, pp. 556–562. MIT Press, Cambridge (2001) 8. Hoyer, P.O.: Nonnegative Matrix Factorization with Sparseness Constraits. Journal of Machine Learning Research 5, 1457–1469 (2004) 9. Paatero, P., Tapper, U.: Positive Matrix Factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111– 126 (1994) 10. Pascual-Montano, A., Carabo, J.M., Kochi, K., Lehmann, D., Pascual-Marqui, R.: Nonsmooth Nonnegative Matrix Factorization (nsNMF). IEEE Trans. Pattern Analysis and Machine Intelligence 3, 403–415 (2006) 11. Chu, M.T., Diele, F., Plemmons, R., Ragni, S.: Optimality, computation, and interpretation of nonnegative matrix factorizations (2004) (preprint), http://www.wfu.edu/~plemmons 12. Berry, M.W., Browne, M., Langville, A., Pauca, V.P., Plemmons, R.: Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis 52(1), 155–173 (2007)
Stability of Positive Fractional Continuous-Time Linear Systems with Delays Tadeusz Kaczorek Bialystok University of Technology, Faculty of Electrical Engineering, Wiejska 45D, 15-351 Bialystok, Poland
[email protected]
Abstract. Necessary and sufficient conditions for the asymptotic stability of positive fractional continuous-time linear systems with delays are established. It is shown that: 1) the asymptotic stability of the positive fractional system is independent of their delays, 2) the checking of the asymptotic stability of the positive fractional systems with delays can be reduced to checking of the asymptotic stability of positive standard linear systems without delays. Keywords: stability, positive, fractional, linear system, delay.
1 Introduction A dynamical system is called positive if its trajectory starting from any nonnegative initial states remains forever in the positive orthant for all nonnegative inputs. An overview of state of the art in positive systems theory is given in monographs [10, 16]. The problems of stability and control of system with delays have been considered in [3, 11, 12, 13, 23]. The stability and the robust stability of positive discrete-time linear systems without delays and with delays have been investigated in [1-10, 14-24]. The stability of positive continuous-time linear systems with delays have been addressed in [17]. In this paper new necessary and sufficient conditions for asymptotic stability of positive fractional continuous-time linear systems with delays will be presented. It will be shown that the asymptotic stability of positive fractional continuous-time linear systems is independent of their delays and checking of asymptotic stability of the system with delays can be reduced to checking of the stability of positive systems without delays. The paper is organized as follows. In section 2 the fractional continuous-time linear systems and their solutions are recalled. Necessary and sufficient conditions for the positivity of this class of fractional systems with delays are given in section 3. The main result of the paper is presented in section 4, where the necessary and sufficient conditions for the asymptotic stability of the positive fractional linear systems with delays are established. Concluding remarks are given in section 5. The following notation will be used: ℜ - the set of real numbers, Z + - the set of m - the set of n × m nonnegative integers, ℜ n×m - the set of n × m real matrices, ℜ n× +
matrices with nonnegative entries and ℜ +n = ℜ +n×1 , I n . -the n × n identity matrix. The strictly positive vector x with all positive components will be denoted by x > 0. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 305–311, 2011. © Springer-Verlag Berlin Heidelberg 2011
306
T. Kaczorek
2 Preliminaries In this paper the Caputo definition will be used f ( n) dα 1 ( ) = dτ , n − 1 < α ≤ n ∈ N = {1,2,...} f t Γ(n − α ) 0 (t − τ )α +1−n dt α t
∫
where α ∈ ℜ
is the order of fractional derivative,
f ( n) (τ ) =
d n f (τ ) dτ n
(2.1)
and
∞
Γ( x) = ∫ e −t t x−1dt is the gamma function. 0
Consider the continuous-time fractional linear system with delays
dα x(t ) = A0 x (t ) + A1 x(t − d ) + Bu (t ), 0 < α ≤ 1 dt α
(2.2)
where x(t ) ∈ ℜ n is the state vector, u (t ) ∈ ℜ m is the input vector and Ak ∈ ℜ n×n ,
k = 1,2 B ∈ ℜ n×m , d is a delay. The initial conditions for (2.2) have the form x(t ) = x0 (t ) for t ∈ [−d ,0]
(2.3)
The solution to the equation (2.2) with (2.3) can be found by the use of the step method [17]. For 0 ≤ t ≤ d the solution has the form [17] t
x(t ) = Φ 0 (t ) x0 (0) + ∫ Φ(t − τ )[ A1 x0 (τ − d ) + Bu (τ )]dτ
(2.4)
0
where ∞ A0 t kα A t ( k +1)α −1 , Φ (t ) = ∑ 0 k =0 Γ (kα + 1) k = 0 Γ[(k + 1)α ] ∞
Φ 0 (t ) = ∑
k
k
(2.5)
Knowing the state vector x(t ) for 0 ≤ t ≤ d in a similar way we can find the state vector for d ≤ t ≤ 2d and next for 2d ≤ t ≤ 3d , … .
3 Positive Fractional Continuous-Time Systems with Delays Definition 3.1. The fractional continuous-time linear systems with delays (2.2) is called positive if x(t ) ∈ ℜ n+ , t ≥ 0 for any initial conditions
x0 (t ) ∈ ℜ n+ for t ∈ [−d ,0] and all input vectors u (t ) ∈ ℜ n+ , t ≥ 0 .
(3.1)
Stability of Positive Fractional Continuous-Time Linear Systems with Delays
307
A real matrix A ∈ ℜ n×n is called the Metzler matrix if its off-diagonal entries are nonnegative. Let M n be the set of n × n Metzler matrices. Theorem 3.1. The fractional continuous-time linear systems (2.2) for 0 < α < 1 is positive if and only if
A0 ∈ M n , A1 ∈ ℜ n+×n , B ∈ ℜ n+×m
(3.2)
Proof. It is well-known [17] that Φ 0 (τ ) ∈ ℜ n+×n and Φ (τ ) ∈ ℜ +n×n if and only if
A0 ∈ M n . From (2.4) it follows that x(t ) ∈ ℜ n+ , t ≥ 0 if A1 ∈ ℜ n+×n , B ∈ ℜ +n×m and u (t ) ∈ ℜ n+ for t ≥ 0 . The necessity can be shown in similar way as in [17].
□
4 Asymptotic Stability of the Positive Fractional Systems Consider the autonomous fractional positive linear system with delay
dα x(t ) = A0 x(t ) + A1 x(t − d ), 0 < α ≤ 1 dt α
(4.1)
where A0 ∈ M n , A1 ∈ ℜ n+×n and d > 0 . Definition 4.1. The positive system (4.1) is called asymptotically stable if
lim x(t ) = 0 for any initial conditions (3.1) t →∞
(4.2)
Definition 4.2. A vector xe ∈ ℜ n+ is called the equilibrium point of the positive
asymptotically stable system (2.2) for Bu (t ) = 1n = [1 ... 1]T ∈ ℜ n+ if the following condition is satisfied 0 = A0 xe + A1 xe + 1n
(4.3)
xe = −( A0 + A1 ) −11n
(4.4)
From (4.3) we have
since for asymptotically stable system (2.2) the matrix A0 + A1 is invertible and the inverse matrix − ( A0 + A1 ) −1 ∈ ℜ n+×n [16]. Theorem 4.1. The positive fractional system with delay (4.1) is asymptotically stable if and only if there exists a strictly positive vector λ ∈ ℜ n+ such that
( A0 + A1 )λ < 0
(4.5)
Proof. First we shall show that if the positive system (4.1) is asymptotically stable then there exists a strictly positive vector λ > 0 satisfying (4.5). If the positive
308
T. Kaczorek
system (4.1) is a asymptotically stable then the equilibrium point (4.4) is a strictly positive vector an we can choose λ = xe = −( A0 + A1 ) −11n . This vector satisfies the condition (4.5) since
( A0 + A1 )λ = −( A0 + A1 )( A0 + A1 ) −11n = −1n
(4.6)
Now we shall show that the positive system (4.1) is asymptotically stable if there exists strictly positive vector λ satisfying (4.5). It is well-known that the positive system (4.1) is asymptotically stable if and only if the corresponding transpose system
dα x (t ) = A0T x (t ) + A1T x(t − d ) (T denotes the transpose) α dt
(4.7)
is asymptotically stable. As candidate for a Lyapunov function for the positive system (4.7) we chose the function t
V [ x(t )] = xT (t )λ +
∫x
T
(τ )dτ α A1λ
(4.8)
t −d
which is positive for any nonzero x(t ) ∈ ℜ n+ , t ≥ 0 . Using (4.8) and (4.7) we obtain
⎤ ⎡t T α ⎢ x (τ )dτ ⎥ A1λ (4.9) ⎥⎦ ⎢⎣t − d T T T T T T = [ A0 x(t ) + A1 x (t − d )] λ + [ x (t ) + x (t − d )] A1λ = x (t )[ A0 + A1 ]λ
d α V [ x(t )] d α x T (t ) dα = λ+ α α α dt dt dt
∫
If (4.5) holds then from (4.9) we have asymptotically stable.
d α V [ x(t )] < 0 and the system (4.1) is dt α □
Theorem 4.2. The positive fractional system with delay (4.1) is asymptotically stable if and only one of the following equivalent conditions is satisfied: The positive system without delay i)
x (t ) = Ax(t ), A = A0 + A1 ii)
(4.10)
is asymptotically stable, The matrix A is asymptotically stable Metzler matrix.
Proof. In [20] it was shown that the positive system (4.10) is asymptotically stable if and only if there exists a strictly positive vector λ ∈ ℜ n+ such that (4.5) holds. Hence by Theorem 2 the positive system (4.1) is asymptotically stable if and only if the positive system (4.10) is asymptotically stable. It is well-known [16] that the positive system (4.10) (and (4.1)) is asymptotically stable if and only if the matrix A is □ asymptotically stable Metzler matrix.
To check the asymptotic stability of the Metzler matrix A the following theorem is recommended [22].
Stability of Positive Fractional Continuous-Time Linear Systems with Delays
309
Theorem 4.3. The matrix A ∈ ℜ n×n is a asymptotically stable Metzler matrix if and only if one of the following equivalent conditions is satisfied: i) all coefficients a0 ,..., an −1 of the characteristic polynomial
det[ I n s − A] = s n + an−1s n−1 + ... + a1s + a0
ai ≥ 0 , i = 0,1,…,n – 1,
are positive, i.e. ii)
(4.11)
the diagonal entries of the matrices An(k−)k for k = 1,…,n – 1
(4.12)
are negative, where
An( 0)
⎡a11(0 ) ⎢ = A=⎢ # ⎢a n(0,1) ⎣
bn(0−1)
⎡ a1(,0n) ⎤ ⎢ ⎥ = ⎢ # ⎥, cn( 0−1) = [a n( 0,1) ⎢a n( 0−1) ,n ⎥ ⎦ ⎣
An( k−k) = An( −n−k1) −
⎡ a11( 0 ) ... a1(,0n)−1 ⎤ ... a1(,0n) ⎤ (0) ( 0) ⎥ ⎡ An−1 bn−1 ⎤ ⎢ ⎥ # ⎥ ... # ⎥ = ⎢ ( 0 ) , An( 0−1) = ⎢ # ... (0) ⎥ a n,n ⎦⎥ ⎢c ⎢a n( 0−1) ,1 ... a n( 0−1) ,n −1 ⎥ ... a n( 0,n) ⎥⎦ ⎣ n−1 ⎣ ⎦
bn( −k −k1) cn( k−−k1) a n( k−−k1+)1,n− k +1
... a n( 0,n)−1 ]
⎡ a11( k ) ... a1(,kn)−k ⎤ (k ) ⎢ ⎥ ⎡A ... # ⎥ = ⎢ (nk−)k −1 =⎢ # ⎢an( k−)k ,1 ... a n( k−)k ,n−k ⎥ ⎢⎣ cn−k −1 ⎣ ⎦
(4.13) bn( k−)k −1 ⎤ ⎥, an( k−)k ,n −k ⎥⎦
⎡ a1(,kn)−k ⎤ ⎢ ⎥ (k ) (k ) (k ) bn( k−)k −1 = ⎢ # ⎥, cn−k −1 = [a n −k ,1 ... a n−k ,n−k −1 ] ⎢an( k−)k −1,n −k ⎥ ⎣ ⎦ for k = 0,1,…,n – 1. From Theorem 4.2 we have the following important corollary. Corollary 4.1. The asymptotic stability of the positive fractional linear systems (4.1) is independent of its delay. Theorem 4.4. The positive fractional linear system (4.1) is unstable if at least one diagonal entry of the Metzler matrix A = A0 + A1 is nonnegative. The proof follows immediately from Theorem 4.2 and Theorem in [16]. Example 4.1. Consider the positive fractional system (4.1) with the matrices
1 ⎤ ⎡0.4 0.2⎤ ⎡a A0 = ⎢ ⎥ ⎥, A1 = ⎢ ⎣ 0 .1 0 .5 ⎦ ⎣0.5 − 2⎦
(4.14)
and arbitrary delay d > 0 . Find the value of the parameter a for which the system is asymptotically stable.
310
T. Kaczorek
By Theorem 4.4 the positive fractional system (4.1) with (4.14) is unstable if the diagonal entry (1,1) of the Metzler matrix ⎡a + 0.4 1.2 ⎤ A = A0 + A1 = ⎢ ⎥ − 1.5⎦ ⎣ 0 .6
(4.15)
is nonnegative i.e. a + 0.4 ≥ 0 . Using the condition i) of Theorem 4.2 we obtain det[ Is − A] =
s − a − 0.4
− 1 .2
− 0.6
s + 1 .5
= s 2 + (1.1 − a) s − (1.5a + 1.32)
(4.16)
and the positive fractional system is asymptotically stable if and only if the coefficients of the polynomial (4.16) are positive 1.1 − a > 0 and 1.5a + 1.32 < 0
(4.17)
Therefore, the positive fractional system (4.1) with (4.14) is asymptotically stable for 1.32 arbitrary delay d > 0 if a < − = −0.88 . 1 .5 The same result can be obtained by the use of the condition ii) of Theorem 4.2.
5 The References Section Necessary and sufficient conditions for the asymptotic stability of continuous-time linear systems with delays have been established (Theorem 4.1 and 4.2). It has been shown that: 1) The asymptotic stability of the positive fractional system is independent of their delays, 2) The checking of the asymptotic stability of the positive fractional systems with delays can be reduced to checking of the asymptotic stability of positive standard linear systems without delays. The considerations can be also extended for fractional positive 2D continuous-discrete linear systems with delays. Acknowledgments. This work was supported by Ministry of Science and Higher Education in Poland under work No NN514 1939 33.
References 1. Buslowicz, M.: Robust stability of positive discrete-time linear systems with multiple delays with unity rank uncertainty structure or non-negative perturbation matrices. Bull. Pol. Acad. Techn. Sci. 55(1), 347–350 (2007) 2. Buslowicz, M.: Simple stability conditions for linear positive discrete-time systems with delays. Bull. Pol. Acad. Techn. Sci. 50(4) (2008) 3. Buslowicz, M.: Robust stability of dynamical linear stationary systems with delays. Publishing Department of Technical University of Bialystok, Warszawa-Bialystok (2000) (in Polish) 4. Buslowicz, M.: Robust stability of scalar positive discrete-time linear systems with delays. In: Proc. Int. Conf. on Power Electronics and Intelligent Control, Warszawa, Paper 163, on CD-ROM (2005)
Stability of Positive Fractional Continuous-Time Linear Systems with Delays
311
5. Buslowicz, M.: Stability of positive singular discrete-time system with unit delay with canonical forms of state matrices. In: Proc. 12th IEEE Int. Conf. on Methods and Models in Automation and Robotics, Miedzyzdroje, pp. 215–218 (2006) 6. Buslowicz, M.: Robust stability of positive discrete-time linear systems with multiple delays with linear unit rank uncertainty structure or non-negative perturbation matrices. Bull. of Pol. Acad. of Sci., Tech. Sci. 52(2), 99–102 (2004) 7. Buslowicz, M., Kaczorek, T.: Robust stability of positive discrete-time interval systems with time-delays. Bull. of Pol. Acad. of Sci., Tech. Sci. 55(1), 1–5 (2007) 8. Buslowicz, M., Kaczorek, T.: Stability and robust stability of positive discrete-time systems with pure delays. In: Proc. of the 10th IEEE Int. Conf. on Methods and Models in Automation and Robotics, Miedzyzdroje, vol. 1, pp. 105–108 (2004) 9. Buslowicz, M., Kaczorek, T.: Robust stability of positive discrete-time systems with pure delays with linear unit rank uncertainty structure. In: Proc. of the 11th IEEE Int. Conf. on Methods and Models in Automation and Robotics, Miedzyzdroje, Paper 0169, on CD-ROM (2005) 10. Farina, L., Rinaldi, S.: Positive linear systems; Theory and applications. J. Wiley, New York (2000) 11. Górecki, H.: Analysis and synthesis of control systems with delays. WNT, Warszawa (1971) (in Polish) 12. Górecki, H., Fuksa, S., Grabowski, P., Korytowski, A.: Analysis and synthesis of time delay systems. PWN-J. Willey, Warszawa - Chichester (1989) 13. Górecki, H., Korytowski, A.: Advances in optimizations and stability analysis of dynamical systems. Publishing Department of University Mining and Metallurgy, Kraków (1993) 14. Hinrichsen, D., Hgoc, P.H.A., Son, N.K.: Stability radii of positive higher order difference systems. System & Control Letters 49, 377–388 (2003) 15. Hmamed, A., Benzaouia, A., Ait Rami, M., Tadeo, F.: Positive stabilization of discretetime systems with unknown delays and bounded control. In: Proc. European Control Conference, Kos, Greece, pp. 5616–5622, paper ThD07.3 (July 2007) 16. Kaczorek, T.: Positive 1D and 2D Systems. Springer, London (2002) 17. Kaczorek, T.: Stability of positive continuous-time linear systems with delays. Bul. Pol. Acad. Sci. Techn. 57(4), 395–398 (2009) 18. Kaczorek, T.: Stability of positive discrete-time systems with time-delays. In: Proc. 8th World Multiconference on Systemics, Cybernetics and Informatics, Orlando Florida, USA, pp. 321–324 (July 2004) 19. Kaczorek, T.: Choice of the forms of Lyapunov functions for positive 2D Roesser model. Int. J. Applied Math. And Comp. Sci. 17(4), 471–475 (2007) 20. Kaczorek, T.: Asymptotic stability of positive 1D and 2D linear systems. In: Recent Advances in Control and Automation, pp. 41–52. Academ. Publ. House EXIT (2008) 21. Kaczorek, T.: Practical stability of positive fractional discrete-time linear systems. Bull. Pol. Acad. Techn. Sci. 56(4), 313–317 (2008) 22. Narendra, K.S., Shorten, R.: Hurwitz stability of Metzler matrices. IEEE Trans. Autom. Contr. 55(6), 1484–1487 (2010) 23. Niculescu, S.-I.: Delay effects on stability. A robust control approach. Springer, London (2001) 24. Twardy, M.: On the alternative stability criteria for positive systems. Bull. Pol. Acad. Techn. Sci. 55(4), 303–385 (2007)
Output-Error Model Training for Gaussian Process Models Juš Kocijan1,2 and Dejan Petelin1 1
2
Jozef Stefan Institute, 1000 Ljubljana, Slovenia University of Nova Gorica, 5000 Nova Gorica, Slovenia
[email protected]
Abstract. The training of a regression model depends on the purpose of the model. When a black-box model of dynamic systems is trained, two purposes are particularly common: prediction and simulation. The purpose of this paper is to highlight the differences between the learning of a dynamic-system model for prediction and for simulation in the presence of noise for Gaussian process models. Gaussian process models are probabilistic, nonparametric models that recently generated interest in the machine-learning community. This method can also be used also for the modelling of dynamic systems, which is the main interest of the engineering community. The paper elaborates the differences between prediction- and simulation-purposed modelling in the presence of noise, which is more difficult in the case when we train the model for simulation. An example is given to illustrate the described differences. Keywords: Gaussian process models, dynamic systems, regression, autoregressive models, output-error models.
1
Introduction
Gaussian process (GP) models [9] form a new, emerging, complementary method that can be used for nonlinear, dynamic system identification. The GP model is a probabilistic, nonparametric, black-box model that has generated interest in the machine-learning community over the past 10 years. Because of their properties, among other features, GP models provide a measure of the confidence for their prediction, which makes these models also interesting for solving engineering problems. The modelling of dynamic systems from data or dynamic systems identification, which will be the focus of this paper, is a common engineering tool. It frequently provides, but not always, models of the input-output behaviour that are used for a variety of purposes. Most of the machine-learning methods for regression modelling, like artificial neural networks, GP models, fuzzy models, etc., are meant for the modelling of mapping between the input and output data. This is the mapping of a static function between the given input data set and the output or target data set. Dynamics can be introduced into these models if lagged samples of the input A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 312–321, 2011. c Springer-Verlag Berlin Heidelberg 2011
Output-Error Model Training for Gaussian Process Models
313
and output signals are fed back and used as regressors. In general, it is the same case with GP models. This method of dynamic systems identification with GP models is described in, e.g., [1],[4] and [5]. Dynamic system models can be used [8] for prediction, simulation, optimisation, analysis, control and fault detection. Prediction means that on the basis of previous samples of a process input signal u(k − i) and a process output signal y(k − i) the model predicts one or several steps into the future. There are two possibilities: the model is built to directly predict l steps into the future or the same model is used to predict a further step ahead by replacing the data at instant k with the data at instant k + 1 and using the prediction yˆ(k) from the previous prediction step instead of the measured y(k). This is then repeated indefinitely. The latter possibility is equivalent to simulation. Simulation, therefore, means that only on the basis of previous samples of a process input signal u(k − i) can the model simulate future outputs. Name prediction will, in our case, mean one-step-ahead prediction. Both of these sorts of models can be used for systems optimisation, analysis, control and fault detection [8]. The purpose of this paper is to describe the differences between the learning of the dynamic-system model for prediction and for simulation in the presence of noise for Gaussian process models and to illustrate these differences with an example. These two kinds of models are named differently in the literature. The prediction model, schematically depicted in Fig. 1, is named an autoregressive model with an exogenous input (ARX) or an equation-error or series-parallel model, while the simulation model, schematically depicted in Fig. 1, is named an output-error (OE) model or parallel model. Since the paper describes nonlinear models in general, the prefix N for nonlinear is added and NARX and NOE models are discussed in the rest of the paper. The paper puts an emphasis on modelling for simulation, and therefore training in a parallel configuration.
u(t-d) y(t-1) u(t-d-m) y(t-n) u(t-d)
u(t-d-m)
Model
^ y(t)
q
-1
q -1
^ y(t-1)
Model
^ y(t)
^ y(t-n)
Fig. 1. Series-parallel or equation-error or NARX model (left), where the predicted outputs are functions of previous measurements of the input and output signals. Parallel or output-error or NOE model (right), where the predicted outputs are functions of previous measurements of the input signal only and delayed predictions yˆ are fed back to the input. q −1 is the backshift operator.
314
J. Kocijan and D. Petelin
The structure of the paper is as follows. The differences between the NARX and NOE models are given in Sec. 2. Sec. 3 describes the Gaussian process models and the modelling of NARX and NOE dynamic systems with GP models. An illustrative example is given in Sec. 4 and the conclusions are given at the end of the paper.
2
Differences between NARX and NOE Models
Consider the system y(k) = f (x(k)) +
(1)
where x is a vector of input regressors in the time instant k and is the noise at the output y(k) in the time instant k. If there is no noise in the output measurements, then the NARX and NOE models are the same and the differences between them do not matter. Nevertheless, noise always exist in real-life measurements. In the NARX model these measurements are contained within the input regressors that are delayed samples of the output signal. This means that when the NARX model is learned it is assumed that the noise is filtered through the part or the entire dynamics of the system. There are structures with other noise models, but these are more complex. An overview can be found in, e.g., [10],[6]. If, on the other hand, the model is trained without using delayed measurements in the input regressors, i.e. the output-error model, then it is assumed that noise is entering the output signal after the system. The NARX structure suffers from unrealistic noise assumptions, which leads to biased parameter estimates in the presence of disturbances and other influences on the model is performance [8]. For a dynamic system of order L, the one-step-ahead prediction is calculated with the previous process outputs as yˆ(k) = f (y(k − 1), y(k − 2), . . . , y(k − L), u(k − 1), u(k − 2), . . . , u(k − L)),
(2)
while the simulation is evaluated with the previous model outputs as yˆ(k) = f (ˆ y (k − 1), yˆ(k − 2), . . . , yˆ(k − L), u(k − 1), u(k − 2), . . . , u(k − L)).
(3)
In the first case we have a feedforward prediction, while in the second case we have a recurrent one. A simulation is required whenever the process output cannot be measured during the operation, which is always when the system is simulated independently from the real system. That is frequently the case for fault-detection and control systems design. Nevertheless, the NARX model is by far the most applied model of a dynamic system; not because it is realistic, but because it is easier to train than the NOE model. In the NARX case the model is trained based on loss functions
Output-Error Model Training for Gaussian Process Models
315
that are dependent on the prediction error, while in the NOE case the model is trained based on loss functions that are dependent on the simulation error. The hyperplane of the loss function is much more complicated in the second case. However, the model that is used in the parallel configuration does not necessarily have to be trained in a parallel configuration as well if the noise assumptions are relaxed, which is often the case in engineering practice. The disadvantage of using NARX models for the simulation is the error accumulation, which does not happen with a prediction, but may appear with a simulation. In extreme situations, the NARX model used for a simulation can become unstable, regardless of a good one-step prediction performance. The trade-off between the advantages and disadvantages of the NARX and NOE models needs to be evaluated when the model is developed for a model simulation. The situation is the same with GP models as it is described in the following section.
3
Modelling with Gaussian Processes
A Gaussian process model is a probabilistic non-parametric model for predictions of output variable distributions. Its use and properties for modelling are thoroughly described in [9]. Here, only a brief description, necessary for this paper understanding, is given. A Gaussian process is a collection of random variables that have a joint multivariate Gaussian distribution. Assuming a relationship between an input X = [x1 , . . . , xN ] and an output y, we have y1 , . . . , yN ∼ N (0, Σ), where Σij = cov(yi , yj ) = C(xi , xj ) gives the covariance between the output points corresponding to the input points xi and xj . Thus, the mean μ(x) and the covariance function C(xi , xj ) fully specify the Gaussian process. The covariance function C(xi , xj ) can be interpreted as a measure of the distance between the input points xi and xj . For systems modelling it is usually composed of two main parts: C(xi , xj ) = Cf (xi , xj ) + Cn (xi , xj )
(4)
where Cf represents the functional part and describes the unknown system we are modelling and Cn represents the noise part and describes the model of the noise. A common choice for the functional part of the covariance function is the square exponential covariance function D 1 Cf (xi , xj ) = v exp − wd (xid − xjd )2 (5) 2 d=1
where v and wD are the ‘hyperparameters’ of the covariance function and D is the input dimension. The hyperparameter v controls the magnitude of the covariance, and the hyperparameters wi represent the relative importance of each component xd of the input vector x. The square exponential covariance function represents
316
J. Kocijan and D. Petelin
the smooth and continuous functional part. Some other possible choices for the covariance functions [9] are the Matérn class of covariance functions, exponential, polynomial, rational quadratic, periodic or any other functions having the property of generating a positive, semi-definite, covariance matrix. A common choice for the noise part of the covariance function is the constant covariance function (6) Cn (xi , xj ) = v0 , which is used when the noise is presumed to be white noise. Consider the system y = f (x) +
(7)
with the white Gaussian noise ∼ N (0, v0 ), with the variance v0 and the vector of regressors x from the operating space RD . We put the GP prior with the covariance function (5) with unknown hyperparameters on the space of functions f (.) and the covariance function (6) for the noise part. Within this framework we have [y1 , . . . , yN ]T ∼ N (0, K) with K = Σ + v0 I, where Σ is the covariance matrix for the noise-free system (1) and I is the N × N identity matrix. When assuming a different kind of noise the covariance function should be changed appropriately (e.g., [2] and [7]). Based on the data (X, y), and given a new input vector x∗ , we wish to find the predictive distribution of the corresponding output y ∗ . For a new test input x∗ , the predictive distribution of the corresponding output is y ∗ |(X, y), x∗ and is Gaussian, with a mean and variance [9] μ(x∗ ) = k(x∗ )T K−1 y, σ 2 (x∗ ) = k(x∗ ) − k(x∗ )T K−1 k(x∗ ),
(8) (9)
where k(x∗ ) = [C(x1 , x∗ ), . . . , C(xN , x∗ )]T is the N × 1 vector of covariances between the test and training cases, and k(x∗ ) = C(x∗ , x∗ ) is the covariance between the test input and itself. As can be seen from the presented relations, the obtained model not only describes the dynamic characteristics of a nonlinear system, but also provides information about the confidence in these predictions by means of a prediction variance. The Gaussian process can highlight areas of the input space where the prediction quality is poor, due to the lack of data, by indicating the higher variance around the predicted mean. Unlike other models, there is no model parameter determination as such, within a fixed model structure. With this model, most of the effort consists of finding the parameters of the covariance function. To accurately reflect the correlations present in the training data the hyperparameters of this function must be optimised. This is done with a probabilistic approach to the optimisation of the model. The overall problem of learning unknown parameters from data can be seen to correspond to the predictive distribution P (yN +1 |yN , XN , xN +1 ) of the new target yN +1 given the training data (y, X) and a new input xN +1 . In order to realise this posterior distribution, a prior distribution over the hyperparameters
Output-Error Model Training for Gaussian Process Models
317
can first be defined P (Θ|yN , XN ), followed by the integration of the model over the hyperparameters P (yN +1 |yN , XN , xN +1 ) = P (yN +1 |Θ, yN , XN , xN +1 P (Θ|yN , XN ))dΘ. (10) The computation of such integrals can prove difficult due to the intractable nature of the nonlinear functions. There are two options. The first is to use numerical integration methods, such as the Monte-Carlo approach, but this involves significant computational expense. The second approach is using the Maximum Likelihood optimisation method to maximise the marginal likelihood or evidence. The loss function needs to be maximised or the negative of it minimised. For numerical scaling purposes the log of the marginal likelihood is taken as: 1 1 T −1 N L(Θ) = − log(|KN |) − yN KN yN − log(2π). 2 2 2
(11)
Gaussian processes can, like neural networks, be used to model static nonlinearities and can therefore be used for the modelling of dynamic systems [1], [4], [5] if lagged samples of the input and output signals are fed back and used as regressors. As already mentioned in Sec. 2, depending on the way the regressors are selected the model structure is selected to be series-parallel or parallel. In the same section it is mentioned that the configuration for learning is not necessarily the same as the one for validating the model of the dynamic system. The most important phase of modelling is the validation. The cross-validation of the obtained dynamic model on the validaton data, i.e., samples from the input signal that were not used for the model identification, is a standard procedure for the validation of dynamic system models and the obtained model response is usually evaluated by performance measures. Beside commonly used performance measures, such as, e.g., the mean relative square error (MRSE), which compares only the mean prediction of the model to the output of the process: N 2 i=1 e(i) MRSE = N (12) 2 i=1 y(i) where y(i) and e(i) = yˆ(i)−y(i) are the system’s output and prediction or simulation error in the i-th step of the model prediction or simulation, the performance measures such as the log predictive density error (LPD, [9,1,4]) can be used for evaluating GP models. It takes into account not only the mean prediction but the entire predicted distribution: LPD =
N
1 1 e(i)2 log(2π) + log(σ 2 (i)) + 2 2 2N σ (i) i=1
(13)
318
J. Kocijan and D. Petelin
where σ 2 (i) is the prediction or simulation variance in the i-th step of the prediction or simulation. The performance measure LPD weights the error e(i) more heavily when it is accompanied by a smaller predicted variance σ 2 (i), thus penalising overconfident predictions more than acknowledged bad predictions, indicated by a higher variance. Another possible performance measure is the negative log-likelihood of the training data (LL) that is given in Eq. (11). LL is a measure inherent to the hyperparameter optimisation process and gives the likelihood that the training data is generated by a given, i.e., trained, model. Therefore, it is applicable for a validation on identification data only. The smaller the MRSE, LD and LL are, the better the model is. Optimisation algorithms for the parallel-series or NARX model (2) and for the parallel or NOE model (3) are as follows. Optimisation algorithm for the NARX model: set input, target, covariance function, initial hyperparameters repeat calculate one-step-ahead prediction calculate -LL and its derivative based on prediction error change hyperparameters until -LL is minimal Optimisation algorithm for the NOE model: set input, target, covariance function, initial hyperparameters repeat calculate the simulation response calculate -LL and its derivative based on simulation error change hyperparameters until -LL is minimal The difference between considering the prediction and simulation error has a major impact on the complexity of the optimisation. The difference in the optimisation for the NARX and NOE model is illustrated with an example where two loss functions are calculated. The input and output data are obtained from a selected dynamic Gaussian process model of the first order. The input regressors are delayed samples of the output and input values. Since there are two input regressors and the Gaussian process model uses squared exponential and constant covariance functions, there are four hyperparameters in this process. We fix the two hyperparameters which are the parameters that control the variances v and v0 and calculate the surfaces of the loss functions for the NARX and NOE models, when two remaining hyperparameters controlling the two input regressors are varied. The loss function is a negative logarithm of the marginal likelihood in both cases, but in the case of the NARX model the one-step prediction error is used for the calculation of the loss function and in the case of the NOE model the simulation error is used for the calculation of the loss function. The results can be seen in Fig.2. Fig.2 shows that, not only loss functions surfaces, but also the optimal parameters for the NARX and NOE models, are significantly different. It is apparent
Output-Error Model Training for Gaussian Process Models
319
Fig. 2. NARX (left) and NOE (right) model loss functions (top) and their contour plots (bottom) with black crosses at the positions of minima
that the optimisation of the model parameters for the prediction is a much easier task than the optimisation of the model parameters for the simulation.
4
Illustrative Case Studies
Modelling of parallel-series or NARX and parallel or OE models of a dynamic system with three cases that differ in terms of disturbance noise are illustrated in this section. The system (described in detail in [3]) consists of an acid stream (Q1 ), a buffer stream (Q2 ) and a base stream (Q3 ) that are mixed in a tank (T1 ). Prior to mixing, the acid stream enters another tank (T2 ). The acid and buffer flow rates are assumed to be constant. The effluent pH is the measured variable, which is controlled by manipulating the base flow rate. In [3], a dynamic model of the system is derived, which has the state, input and output variables: x = [Wa4 Wb4 h1 ]T ,
u = Q3,
y = pH,
(14)
where Wa4 and Wb4 are the effluent reaction invariants, and h1 is the liquid level in tank T1 .The obtained state-space model has the form: x˙ = f (x) + g(x)u,
c(x, y) = 0
(15)
The identification and the validation of the GP model of the pH-maintaining system is based on simulation data, generated with the model (15), where the liquid level h1 in the tank T1 is assumed to be constant. Thus, it is presumed that a controller has already been designed to keep the level h1 at the nominal value h∗1 = 14 cm by manipulating the exit flow rate Q4 . Three different cases when the output of the dynamic system is disturbed with white, normally distributed,
320
J. Kocijan and D. Petelin GP NOE model simulation
GP NARX model simulation 8
8 μ ± 2σ μ system
7
6
6
5 y
y
μ ± 2σ μ system
7
5
4 3
4
2 3
1 0
2 1000
2000
3000
4000
5000
6000
1000
7000
1000
2000
e
e
2σ |e|
1 0
2000
3000
4000
5000
6000
7000
2
2
0 3000
4000 t
5000
6000
7000
2σ |e|
1
1000
2000
3000
4000 t
5000
6000
7000
Fig. 3. Simulation on validation input signal (top) with error and confidence band (bottom) for the NARX model (left) and for the NOE model (right) disturbed with white noise with a standard deviation of 0.07 Table 1. Validation measures for the simulation response of the NARX (left) and NOE (right) models for the validation input signal for three different cases of noise disturbance. The numbers in bold are the winning models. NARX 0.07 0.35 0.7 MRSE 0.028 0.077 0.149 LPD -0.76 0.44 1.13
NOE 0.07 0.35
0.7
MRSE 0.058 0.069 0.143 LPD -0.58 0.37 1.09
noise with standard deviations of 0.07, 0.35 and 0.7 are shown. Based on the noisy responses obtained with the model (15), a sampling time of 25 s was selected. The identification signal was generated from a uniform random distribution and a rate of change of the signal of 500 s. The validation signal was obtained using a generator of random noise with a uniform distribution and a rate of change of the signal of 500 s. The conjugate gradients deterministic optimisation method is used to optimize the hyperparameters of the 4th-order NARX model. The simulation response obtained with the Markov chain Monte-Carlo simulation on the validation input signal for the NARX model which was optimised for the prediction is depicted in Fig. 3. Validation measures for the simulation response of the NARX model are given in Table 1. The optimisation of the NOE model is, due to the peculiarities of the loss function hyperplane, made first with the ’differential evolution’ stochastic optimisation method from which the initial values for the deterministic conjugate gradients method is used. Any other selection of optimisation methods should also fulfill the optimisation task. The simulation response obtained with the Markov chain Monte-Carlo simulation on the validation input signal for the NOE model that is optimised for the simulation is depicted in Fig. 3. The validation measures for the simulation response of the NOE model are given in Table 1. It is clear from Table 1 that NOE model learning is the winning method when the noise has larger magnitudes, which is to be expected.
Output-Error Model Training for Gaussian Process Models
5
321
Conclusion
This paper describes the differences between two possible model structures for dynamic systems modelling from noisy data. Two approaches are possible: training for prediction and training for simulation. The obtained dynamic system model is, according to its use, often employed for simulation purposes. Consequently, it would make sense to use optimisation for the simulation. However, the optimisation for the simulation proved to be a much more difficult task than for prediction, so a trade-off between the accuracy and optimisation effort needs to be decided by the user. In this paper the problem is illustrated for the case of the Gaussian process modelling of dynamic systems. Gaussian process models are nonparametric, probabilistic models. The method has become interesting for the machine-learning community and, recently, also in the engineering community, for the modelling of dynamic systems or dynamic systems identification. The differences between the optimisation of the series-parallel model or NARX and the parallel or NOE GP model, which occurs when noise is present in the measurements, have been described and illustrated in the paper. The NOE GP model optimisation again proved to be the more difficult task, but one needs to be aware that this depends on the selected number of model regressors.
References 1. Ažman, K., Kocijan, J.: Application of Gaussian processes for black-box modelling of biosystems. ISA Transactions 46, 443–457 (2007) 2. Gibbs, M.N.: Bayesian Gaussian Processes for Regression and Classification. Ph.d. thesis, Cambridge University (1997) 3. Henson, M.A., Seborg, D.E.: Adaptive nonlinear control of a pH neutralization process. IEEE Transactions on Control System Technology 2, 169–183 (1994) 4. Kocijan, J., Girard, A., Banko, B., Murray-Smith, R.: Dynamic systems identification with Gaussian processes. Mathematical and Computer Modelling of Dynamic Systems 11(4), 411–424 (2005) 5. Kocijan, J., Likar, B.: Gas-liquid separator modelling and simulation with Gaussian-process models. Simulation Modelling Practice and Theory 16(8), 910– 922 (2008) 6. Ljung, L.: System identification – theory for the user, 2nd edn. Prentice Hall, New Jersey (1999) 7. Murray-Smith, R., Girard, A.: Gaussian Process priors with ARMA noise models. In: Proceedings of Irish Signals and Systems Conference, Maynooth, pp. 147–152 (2001) 8. Nelles, O.: Nonlinear System Identification. Springer, Heidelberg (2001) 9. Rasmusen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006) 10. Sjoeberg, J., Zhang, Q., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P.-Y., Hjalmarsson, H., Juditsky, A.: Nonlinear black-box modelling in system identification: A unified overview. Automatica 31(12), 1691–1724 (1995)
Learning Readers’ News Preferences with Support Vector Machines Elena Hensinger, Ilias Flaounas, and Nello Cristianini Intelligent Systems Laboratory, University of Bristol, UK {elena.hensinger,ilias.flaounas,nello.cristianini}@bristol.ac.uk
Abstract. We explore the problem of learning and predicting popularity of articles from online news media. The only available information we exploit is the textual content of the articles and the information whether they became popular – by users clicking on them – or not. First we show that this problem cannot be solved satisfactorily in a naive way by modelling it as a binary classification problem. Next, we cast this problem as a ranking task of pairs of popular and non-popular articles and show that this approach can reach accuracy of up to 76%. Finally we show that prediction performance can improve if more content-based features are used. For all experiments, Support Vector Machines approaches are used. Keywords: Pattern recognition, Data mining, Applications, Machine learning.
1
Introduction
With the rise of the internet, an ever-growing amount of online news are available to web users. News articles have therefore to compete for the limited resources of readers: their time and attention. Users’ decisions of what to read, or not, are based on multiple types of article information, involving its position on the web page, timing, images or additional media, but mainly its text, especially the title and short introductory descriptions of content. Popular news items are therefore those that succeed in getting the users to click on them. Knowing in advance which articles have the potential to become popular could give news outlets valuable information and essentially a competitive advantage. In this study we address the question: Can we model and predict news popularity? We create a realistic data setting by using only the title and short description of an article, which is exactly the text snippet that is presented to readers before they decide to click or not to click on a given link. We collect this data via Real Simple Syndication (RSS) feeds from nine English-speaking news outlets with an online presence, including broadcast media, news aggregators and newspapers. For each outlet, we analyse data from two specific feeds: the “Top stories” articles, as chosen by the editors, containing the main stories of the day, and the “Most popular”, formed by the click choices of the readers. One naive approach to the data analysis problem at hand could be its modelling through a binary classification task, to separate popular articles from A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 322–331, 2011. c Springer-Verlag Berlin Heidelberg 2011
Learning Readers’ News Preferences with Support Vector Machines
323
non-popular ones. We apply binary Support Vector Machines (SVMs) in such a setting and show that this approach does not lead to satisfactory predictions. The reason for this is that the notion of “popularity” is not something related to the content of the article per se. It is relative to what other options were available to the reader. Thus, we cast the problem as a learning to rank task on pairwise preference data, matched to be from the same outlet and same day. Both articles appeared on the front page of an outlet, but only one of them became popular. Using just the text of description and title, we achieve a pairwise popularity prediction accuracy of up to 76%. Finally, we investigate whether annotation of articles based on their general topic can increase prediction accuracy. We managed to achieve some improvement, which nevertheless opens a new avenue of investigating the effect of including semantic information, such as the presence of celebrities in the news, their sentimental content, etc. Our work is situated in the domain of preference learning, which has developed into two main directions for techniques: learning utility functions and learning preference relations [1]. Approaches of the first category try to learn a function that assigns a utility value to each item, reflecting their overall relationships, based on seen training data. This domain includes the SVM ranking method used in our study, introduced in [2]. Ranking SVMs have been applied to a range of problems, from search engines that adapt to specific user queries [3] to extraction of descriptive keywords for documents [4]. The question of “what makes news” has been typically addressed in media studies, for example for the topic “Technology” [5], and for news in general [6]. With availability of online news data, this question gains more and more attention by computer scientists [7][8][9]. An understanding of which news are interesting for which users can be applied in news recommendations, such as in [10]. The clicks on web pages serve, as in our study, as implicit feedback source that allows to infer user interests and preferences. These user clicks, and inferred popularity, can be influenced by more than just the content of an article: the physical position of the content on web pages has an impact [11], as well as additional media or information, along with novelty of data and the time it needs to become outdated [12]. Our assumption is, however, that content is the strongest factor for user clicks, allowing to focus on it for this study. The rest of this paper is organised as follows: In Sect. 2 we briefly present the methods we are going to apply for our experiments. In Sect. 3 we present our results of applying the methods to our real-world data. Section 4 discusses and analyses the findings.
2
Methods
In this section we first introduce binary Support Vector Machines. Next we show how the task of learning pairwise preference relationships can be transformed into a classification problem, and how this can be solved using Ranking SVM. We
324
E. Hensinger, I. Flaounas, and N. Cristianini
focus on SVMs only, since it has been shown that SVMs outperform other conventional learning methods for the task of text classification [13][14][15][16][17]. 2.1
Classification with SVMs
SVMs is a well studied method for binary classification [18][14]. It realises a maximal margin classifier that distinguishes two data classes by learning a separating hyperplane, described by a linear function w, x + b. A data item xi ∈ Rn is assigned the class label y = +1 if w, xi + b ≥ 1, and the negative class label otherwise. For the case that the data is not linearly separable, slack variables ξ are introduced that allow handling of misclassified training examples. The problem to find the best separating hyperplane for l training items is formulated as a quadratic optimisation problem of the form: minimiseξ,w,b w, w + C li=1 ξi2 (1) subject to
yi (w, xi + b) ≥ 1 − ξi , i = 1, . . . l, ξi ≥ 0 , i = 1, . . . l
(2) (3)
The resulting values for w and b allow to predict the class label for a data item xm via w, xm + b. Depending on the problem under study the application of the kernel trick can be used to improve performance for separating classes that are non-separable with a linear hyperplane[19][20]. Examples of popular kernels include Gaussian, Polynomial, and Sigmoid kernel. However, in the case of text classification the Linear kernel achieves best results due to high dimensionality of data that makes them linearly separable. 2.2
Ranking Pairs with SVMs
The task is to learn the ranking of items x ∈ Rn , based on partial information about the preference relationship between pairs of items (xi , xj ) only. We denote that item xi is preferred to xj by xi xj , and assume a linear utility function u : Rn → R of the form w, x + b that can capture this preference relationship. That is: xi xj ⇐⇒ u(xi ) > u(xj )
(4)
Replacing u(x) with the linear function leads to: w, xi + b > w, xj + b
(5)
⇐⇒ w, xi + b − (w, xj + b) > 0 ⇐⇒ w, (xi − xj ) > 0
(6) (7)
Learning the relationship between two items xi and xj can now be expressed as a binary classification problem on the vectorof their difference x(i,j) = xi − xj . We assign class label y(i,j) = +1, if w, x(i,j) ≥ 0, and y(i,j) = −1 otherwise.
Learning Readers’ News Preferences with Support Vector Machines
325
The SVM formulation has been extended to learn a ranking of items from pairwise preference relationships through the Ranking SVM in [2]: as in binary SVMs, slack variables ξ(i,j) allow to deal with linearly non-separable data, and the entire task for l training item pairs of form x(i,j) is expressed as the quadratic optimisation problem: minimiseξ,w w, w + C (8) all pairs x(i,j) ξ(i,j) subject to y(i,j) (w, x(i,j) ) ≥ 1 − ξ(i,j) , (9) (10) ξ(i,j) ≥ 0 ∀ all pairs x(i,j) The solution is a weight vector w that can be used to predict the preference relationship between two items xi and xj , as well as assign a rank, i.e. compute the utility value for an item xi via u(xi ) = w, xi .
3 3.1
Experiments Datasets
Our data consists of a 16 week interval between June, 29th 2009 and October, 18th 2009 for nine mainstream news outlets of a variety of media types, namely broadcasting media (“CBS News”, “CNN”), newspapers (“The New York Times”, “Florida Times Union”, “The Seattle Post-Intelligencer”, “The Seattle Times”), the magazine “Time”, the newswire “Reuters” and a news aggregator “Yahoo! News”. The outlets we track offer their content online in RSS feed format. A feed contains articles in a structured format. Each article is comprised of its publication date, a title and the summary of the full text of the article. In this study, we focus on English-language news only, which we collect on a daily basis. Detailed information on the processing system is available in [21]. Outlets offer their content organised into different feeds, such as “World News”, “Business” or “Sports”. In order to analyse outlets, one needs to capture all articles published in all feeds over the desired time period – coping with a variety of feed names and changing categories across outlets. To ensure comparable datasets for our study, we focus only on a small and well-defined subset of the total content possible, collecting two specific feeds: The “Top stories” feed and the “Most popular” feed. The first one carries articles that appear on the main page of an outlet. The second offers a list of articles that the audience of the outlet found interesting – by clicking on them in order to read them. The exact methodology to mark a story as “popular” is defined internally by each outlet. To learn what makes an article popular, we create data preference sets of articles that occurred on the main page of an outlet and also became popular, and the ones that were published on the same day, in same outlet and same front page, but did not rise in popularity. Given the same starting conditions per article, we want to explore what makes up their difference for popularity. Therefore, each
326
E. Hensinger, I. Flaounas, and N. Cristianini
article pair captures the relative relationship “more popular than”, rather than providing information on a universal popularity value. For Ranking SVM, we created difference vectors for all article pairs. For our experiments we apply, for each article title and summary, standard text mining pre-processing techniques, namely, stop word removal, stemming [22], and transfer into the bag-of-words (TF-IDF) space [23]. The entire vocabulary of 179239 words is used as features and no further feature selection was applied. 3.2
Experiment 1: Predicting Popularity with Binary SVMs
We conducted an initial experiment to confirm our assumption that it is not simply possible to capture article popularity through treating the problem as a binary classification task. We created ten training and testing sets, each for six consecutive weeks and one following week, respectively, for the time under observation. We considered articles that became popular as a positive class, and the others as negative class. Each training set contained on average 1685 articles per outlet. The two classes have different number of positive and negative samples: only a few articles from the main page will become also popular, as little as 1.6 per day, compared to 31, on average, for the non-popular. We measured class prediction performance by accuracy, and thus we used an equal amount of positive and negative items in the testing sets: for each positive example, one negative from same day and outlet was chosen randomly, resulting in an average of 119 test items per dataset. The experiment was conducted using the SV M Light implementation of the binary SVM classifier [24]. The parameter C in the SVM has to be adjusted empirically since it is data-dependent. We tested five different values, namely C={0.1,0.5,1,5,10}, and we report the best results per outlet. Figure 1 shows the averaged accuracies per outlet for predicting the class of articles: “popular” or “non-popular”. Most values reside around 50%, with 55.79% as average. Only for two outlets popular items can be predicted satisfactorily, with marginally more than 60% accuracy. In this setting, the binary SVMs are based on the assumption that there exists a global “popularity” pattern in the data to be learnt, which can be observed universally for all articles. The experimental results, however, indicate that this is not the case, making way to explore the setting of a pairwise and relative popularity measure. 3.3
Experiment 2: Predicting Popularity with Ranking SVMs
As mentioned before, the information carried in each news item pair, and their according difference vector, is the better than relationship. It is a relative measure which reflects the reality that on different days, very different qualities of articles can be present: some days might bring not very exciting news overall, some others several sensational ones. In both cases, the readers will make some articles
Learning Readers’ News Preferences with Support Vector Machines
327
Fig. 1. Averaged class prediction accuracy and standard deviation of separating “popular” and “non-popular” articles for binary SVMs. It confirms our assumption that the “popularity” concept cannot be captured satisfactorily as an absolute pattern that resides in the content of articles.
become more popular than others. On an absolute scale, these articles will be incomparable, but their popularity might be modelled on a relative scale. Based on this assumption, we conducted experiments with Ranking SVM, on the same set of outlets and time intervals as for the binary SVM case. The training and testing data consisted of difference vectors of pairwise items, with one popular and one non-popular article, both published on the main page of the same day and the same outlet. As before, the training sets contained six weeks of data, with 13427 pairs on average for all outlets and datasets. Testing data contained one following week of data, with averaged 2264 pairs. The increased size of data, compared to the binary SVM case, is due to the fact that we paired each positive article with all possible negative articles that were published in the same outlet and the same day. We used the SV M Rank implementation [2] and we tried different settings for its parameter C, namely {1, 5, 10, 20}, adapting the choice of values to this code and data. We measure pairwise popularity prediction accuracy, i.e. the accuracy for correctly predicting the orientation of testing pairs, each consisting of one popular and one non-popular item. We report the best resulting accuracies per different C values tested, per outlet, averaged over all used datasets, in Fig. 2. In seven out of nine cases the average accuracies were above 60%, and in three cases better than 70%. These results confirm our assumption that popularity can be better modelled as a relative concept between articles, rather than an absolute one.
328
E. Hensinger, I. Flaounas, and N. Cristianini
Fig. 2. Averaged prediction accuracy and standard deviation for pairs of “popular” and “non-popular” articles using Ranking SVMs. In seven out of nine cases the average accuracies were above 60%.
3.4
Experiment 3: Exploring Tags as Popularity Features
For the final experiment we wanted to explore whether and how prediction accuracy can be improved from using only words as features by incorporating features based on topic tags. Tags were assigned automatically to an article in two different ways: a) through the annotation of the RSS feeds, as advertised in the outlet web pages and b) via topic-classifiers. For the first category, we used 24 tags, namely World news, Local news, Business, Politics, Science, Technology, Entertainment, Space, Government, Nuclear weapons, Religion, Terrorism, Sport, Events, Physics, Accident and disasters, Biology, Life sciences, Maths, Chemistry, War and conflict, Crime, Environment, Health. If an article appeared in the “Main” feed and in one or more specialised feeds of the above, we annotated it accordingly. For the second category we used SVMs to categorise articles according to their topic. We recognise 14 topics using the methodology described in [25]: Accident and disasters, Business, Crime, Environmental issues, General science, Politics, ScienceEnvironment, Health, Life sciences, Physics, Space, Technology, US-elections, War and conflict. Results for the learning to rank task using only words as features and using the combination of words and tags are illustrated in Fig. 3. The combination of tags and words as features leads to an improvement of pairwise preference prediction accuracy in most cases, in the best case for “CNN” with 19.66%. Overall, we achieve a maximum average accuracy of 83.95% for “New York Times”. Confident conclusion on how much the addition of tags can contribute to popularity
Learning Readers’ News Preferences with Support Vector Machines
329
Fig. 3. Preference pair prediction accuracy and standard deviation by Ranking SVMs, for different sets of features. For most outlets, word-based features combined with tagbased ones can improve the overall prediction accuracy. However, the error bars do not allow for a definite conclusion on the impact of these additional features.
prediction cannot yet be drawn. It is worth noting that using only the topic tags as features does not lead to reliable prediction of the popular articles.
4
Discussion
In this work, we reported on the exploration of news article popularity concept. We created a framework that can learn what is popular and managed to separate popular from non-popular articles. This was achieved by modelling the problem as a pairwise ranking task which captures the “better than” relationship between articles. We achieve accuracies ranging from 51.8% to 76.5% among nine different media, using as features only the content, i.e. the words in the articles. These accuracies could be additionally improved in several cases by introducing topictags as an extra set of features, leading up to 84% accuracy. On the other hand we found that treating popular and non-popular stories as two distinct classes and trying to separate them under a binary classification framework does not lead to satisfactory performance. Therefore, we treat popularity as a relative concept among articles published in the same period of time. Future work includes investigation of different semantic information, such as presence of celebrities in the articles, the positive/negative sentiment of the words used in articles, presence of controversial topics, etc. Other factors, such
330
E. Hensinger, I. Flaounas, and N. Cristianini
as novelty of stories or number of accompanying pictures, can be used as extra features. Acknowledgements. I. Flaounas is supported by the A. S. Onassis Public Benefit Foundation; N. Cristianini is supported by a Royal Society Wolfson Merit Award. This research was partly supported by the “University of Bristol Bridging the Gaps Cross-Disciplinary Feasibility Account” (EP/H024786/1). Group activities are supported by PASCAL2 Network of Excellence.
References 1. Fürnkranz, J., Hüllermeier, E.: Preference learning: An introduction. In: Preference Learning. Springer, Heidelberg (2010) 2. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 133–142. ACM, New York (2002) 3. Joachims, T., Radlinski, F.: Search engines that learn from implicit feedback. IEEE Computer 40(8), 34–40 (2007) 4. Jiang, X., Hu, Y., Li, H.: A ranking approach to keyphrase extraction. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009), pp. 756–757. ACM, New York (2009) 5. Center, P.R.: When technology makes headlines: The media’s double vision about the digital age. Technical report, Pew Research Center’s Project for Excellence in Journalism (2010) 6. Gans, H.J.: Deciding What’s News: A Study of CBS Evening News, NBC Nightly News, Newsweek, and Time, 25th anniversary edn. Northwestern University Press (2004) 7. Steinberger, R., Pouliquen, B., Van der Goot, E.: An introduction to the europe media monitor family of applications. In: Information Access in a Multilingual World - Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR 2009), pp. 1–8 (2009) 8. Bautin, M., Ward, C., Patil, A., Skiena, S.: Access: News and blog analsysis for the social sciences. In: Proceedings of the 19th International Conference on World Wide Web (WWW), pp. 1229–1232 (2010) 9. Flaounas, I., Turchi, M., Ali, O., Fyson, N., De Bie, T., Mosdell, N., Lewis, J., Cristianini, N.: The structure of EU mediasphere. PLoS ONE 5, e14243 (2010) 10. Liu, J., Dolan, P., Pedersen, E.R.: Personalized news recommendation based on click behavior. In: Proceedings of the 2010 International Conference on Intelligent User Interfaces (IUI), pp. 31–40 (2010) 11. Wu, F., Huberman, B.A.: Popularity, novelty and attention. In: Proceedings 9th ACM Conference on Electronic Commerce (EC 2008), pp. 240–245 (2008) 12. Szabó, G., Huberman, B.A.: Predicting the popularity of online content. Commun. ACM 53(8), 80–88 (2010) 13. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49. ACM Press, New York (1999)
Learning Readers’ News Preferences with Support Vector Machines
331
14. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 15. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representation for text categorization. In: 7th International Conference on Information and Knowledge Management (CIKM), pp. 148–155 (1998) 16. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002) 17. Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer, Dordrecht (2002) 18. Boser, B., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Conference on Computational Learning Theory (COLT), pp. 144–152 (1992) 19. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 20. Scholkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge Mass (2002) 21. Turchi, M., Flaounas, I., Ali, O., De Bie, T., Snowsill, T., Cristianini, N.: Found in translation. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS, vol. 5782, pp. 746–749. Springer, Heidelberg (2009) 22. Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980) 23. Liu, B.: Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007) 24. Joachims, T.: Making large-scale svm learning practical. In: Advances in Kernel Methods: Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999) 25. Flaounas, I.N., Turchi, M., Cristianini, N.: Detecting macro-patterns in the european mediasphere. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and International Conference on Intelligent Agent Technology - Workshops, pp. 527–530 (2009)
Incorporating a Priori Knowledge from Detractor Points into Support Vector Classification Marcin Orchel AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Krak´ ow, Poland
[email protected]
Abstract. In this article, we extend the idea of a priori knowledge in the form of detractor points presented recently for Support Vector Classification. We show that detractor points can belong to the new type of support vectors – training samples which lie outside a margin bounded region. We present the new application for a priori knowledge from detractor points – improving generalization performance of Support Vector Classification while reducing a complexity of a model by removing a bunch of support vectors. The experiments show that indeed the new type of a priori knowledge improves generalization performance of reduced models. The tests were performed on selected classification data sets, and on stock price data from public domain repositories. Keywords: Support Vector Machines, a priori knowledge.
1
Introduction
This article is a major extension of the idea of a priori knowledge in the form of detractor points first presented in [12] and incorporated to the Support Vector Classification (SVC). The SVC method belongs to the group of methods called Support Vector Machines invented by Vapnik [14]. A priori knowledge in machine learning (ML) is defined as an additional knowledge for an existing sample set. When it is defined in terms of a particular domain, it is called domain dependent a priori knowledge, otherwise it is domain independent a priori knowledge. The example of the latter for a classification problem is the information about proper classification in knowledge sets (defined sets of points), particularly in continuous areas of an input space. Various types of areas were investigated recently for SVC: polyhedral sets [3] [2], ellipsoidal sets including spheroidal sets [4], nonlinear sets [11]. A polyhedral set is defined in the form of a set of linear equations, a spheroidal set with a center and a radius, an ellipsoidal set with a center and a matrix. Additionally, every set must have a classification value. Generally, the simplest formulation from these sets has a spheroidal set, defined with a point, a number and a classification value. The important aspect of a priori knowledge is efficient incorporation into the ML method. There are three methods of incorporation, either modify input data ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 332–341, 2011. c Springer-Verlag Berlin Heidelberg 2011
Incorporating a Priori Knowledge from Detractor Points into SVC
333
like a set of features or some input parameters, modify the ML algorithm, or modify the ML method output. For SVC, the second option leads to a modification of the optimization problem, particularly a modification of a kernel function. For example, a priori knowledge in a form of classification of a finite set of points could be directly incorporated by enlarging a training set, the method is called a sample method [7]. Polyhedral sets were incorporated to SVC by modifying the optimization problem – by adding linear constraints [3], although an alternative incorporation scheme was proposed [8]. Domain independent a priori knowledge proposed in this article is based on incorporating lower bounds on distances from particular points, called detractor points to a decision surface. A detractor is a detractor point with a classification value and a number, called a detractor parameter, which is a lower bound on that distance. A detractor can be interpreted as a knowledge hypersphere with variable radius dependent on a decision function. Thus, the one of differences between detractors and other mentioned earlier knowledge sets is that a detractor knowledge sphere is defined dynamically, while the others are defined statically with all parameters known before running the ML method. Additionally, for the case of a soft margin classifier type of the SVC, detractors could be treated as recommendations, which means that influential power of detractors on a decision boundary depends on other factors, here on slack variables. A detractor can belong to the new type of support vectors – training samples which lie outside a margin bounded region. This means that it is possible to include a new type of support vectors in a final decision function. Complexity of a specification of a detractor is similar to a spheroidal knowledge set, since there are only three parameters a vector, a classification value and a number. We use a priori knowledge in the form of detractor points in reduced models, which are created by removing a bunch of support vectors. The reduced models were presented for a regression case in [6]. The goal of creating such models is to reduce the complexity of the models, while preserving good performance of the classifier. Reduced models are more suitable for further processing, such as testing new samples. Comparing incorporation to the SVC method, detractors are incorporated by adding detractor points to an input space and modifying the SVC optimization problem by adding special weights to inequality constraints. There are multiple attempts to incorporate spheroidal sets [4]. Incorporation of polyhedral sets proposed in [3] is based on defining additional constraints for the SVC optimization problem, and the method needs an optimization library to solve new subproblems. For detractors, a modification of Sequential Minimization Optimization (SMO) method [13] which analytically solves two parameter subproblems was proposed.
2
Detractors
A detractor for a classification case is defined as a point, called a detractor point with a classification value, and with the additional parameter d, called a detractor parameter, which is a lower bound on a distance from the detractor point to the decision surface, measured in functional margin units. The incorporation
334
M. Orchel
of detractors into SVC contains two steps: adding a detractor point with a classification value to a training set, and modifying the SVC primal optimization problem. If a training set already contains a detractor point, the first step is skipped. Now, we investigate closely a modification of the optimization problem. We use the formulation of the SVC optimization problem with sample weights, investigated for C-SVC in [17][15][5][10] and for ν-SVC in [16]. In this article, we consider incorporating detractors to C-SVC. The 1-norm soft margin SVC optimization problem for samples ai , with sample weights Ci is OP 1. Minimization of: f (w, b, ξ) =
1 2 w + C · ξ 2
with constraints: yi g (ai ) ≥ 1 − ξi , ξ ≥ 0 for i ∈ {1..n}, where C 0, g (ai ) = w · ai + b. The i-th sample for which yi g ∗ (ai ) = 1 is called a margin sample. Margin boundaries are defined as the two hyperplanes g (x) = −1 and g (x) = 1. We introduce the SVC optimization problem with additional weights ϕ for which d = 1 + ϕ OP 2. Minimization of: f (w, b, ξ) =
1 2 w + C · ξ 2
with constraints: yi g (ai ) ≥ 1 − ξi + ϕi , ξ ≥ 0 for i ∈ {1..n}, where C 0, ϕ ≥ 0, g (ai ) = w · ai + b. The new weights ϕ are only present in constraints. When ϕ = 0, the OP 2 is equivalent to the OP 1. A functional margin for a point p is defined as a value yp g (p). A value v in functional margin units is equals to v/ w. We can see that a detractor parameter is a lower bound on a distance from a detractor sample to a decision boundary measured in functional margin units: when we omit ξi in constraints for simplicity, we can see that yi g ∗ (ai ) ≥ di , when we divide both sides by w, we get yi g ∗ (ai ) / w ≥ di / w. We can also note that when we take into account ξi , detractors can be treated as recommendations, and their influential power depends on slack variables. Note that modifying a detractor parameter does not always lead to a new decision boundary. Let’s assume that we modify only a one sample p and ϕp is equals to zero before the modification. When yp g ∗ (p) > 1, then setting 0 < ϕp ≤ yp g ∗ (p) − 1 does not affect a solution. When ϕp > yp g ∗ (p) − 1, the solution will be different, but not necessarily a decision boundary. Particularly, setting ϕp > 0 could increase a slack variable and the solution would remain the same, when a value of Cp is small.
Incorporating a Priori Knowledge from Detractor Points into SVC
2.1
335
Interpretation of Detractors as Dynamic Hyperspheres
A detractor sample p can be interpreted as a hypersphere, with a radius equals to ϕp in functional margin units. Therefore this is a dynamic hypersphere with a variable radius which depends on a decision function. The hypersphere must not intersect the margin boundary (in more than one point) yp g (x) = 1. A value of the radius is represented in functional margin units, thus its absolute value varies among solution candidates. For the two solution candidates g1 (x) = 0 and g2 (x) = 0, where g2 (x) = ag1 (x) and a = 0 (both hyperplanes have the same geometric locations), the hyperspheres are respectively S1 (p, r), and S2 (p, r/a) (Fig. 1). 7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
-1
-1 -2
-2 -5 -4 -3 -2 -1
0
1
2
3
4
-5 -4 -3 -2 -1
0
1
2
3
4
Fig. 1. Interpretation of detractors as dynamic hyperspheres. We can see the two solution candidates for particular data (g1 (x) on the left and g2 (x) on the right) with detractors visualized by circles. In the right figure, radii of detractor’s circles differ from the first one proportionally to the changes of the functional margins for the detractors.
2.2
An Efficient Solution of the SVC Optimization Problem with Detractors
In order to construct an efficient algorithm for the OP 2 its dual form was derived. The final form of the dual problem is OP 3. Maximization of: 1 d (α) = α · (1 + ϕ) − αT Qα 2 with constraints α · y = 0, 0 ≤ α ≤ C, where Qij = yi yj (ai · aj ), for all i, j ∈ {1..n}.
336
M. Orchel
It differs from the original SVC dual form by only α · ϕ term. In the above formulation, similarly as for the original SVC, it is possible to introduce nonlinear decision functions by using a kernel function instead of a scalar product. The final decision boundary has a form: g ∗ (x) =
n
yi α∗i K (ai , x) + b∗ = 0 ,
i=1
where K (·, ·) is a kernel function. A sample i is a support vector, when α∗i = 0. Based on the Karush-Kuhn-Tucker complementary condition αi (yi g (ai ) − 1 − ϕi + ξi ) = 0 (C − αi ) ξi = 0 we can conclude which samples could be support vectors. For the original SVC, only a sample which lie on margin boundaries or inside margin boundaries could be a support vector. In the SVC with detractors, also a sample fulfilling ϕi > 0 and lying outside margin boundaries could be a support vector. Such sample is called a detractor support vector. An output model is defined based on support vectors. Introducing the new type of support vectors leads to richer models, where additional samples lying outside margin boundaries could participate in defining a decision function. In order to solve OP 3, a decomposition method similar to SMO [13] which solves the original SVC dual optimization problem was derived. For two chosen parameters i1 and i2 the solution without clipping is αnew = αi2 + i2
yi2 (Ei1 − Ei2 ) , κ
where κ = Ki1 i1 + Ki2 i2 − 2Ki1 i2 and Ei =
n
yj αj Kij − yi − yi ϕi .
(1)
j=1
After that, αi2 is clipped in the same way as for SMO, but with variable weights Ci U ≤ αclipped ≤V , i2 old old old where for y1 = y2 : U =max 0, αold i2 , Ci1 − αi1 + αi2 , i2 − αi1 , V = min C old old old for y1 = y2 : U = max 0, αold i1 + αi2 − Ci1 , V = min Ci2 , αi1 + αi2 . The clipped new old old parameter αi1 is αi1 = γ − yi1 yi2 αi2 , where γ = αi1 + yi1 yi2 αi2 . Based on the KKT complementary condition, it is possible to derive equations for the SVC heuristic and the SVC stopping criteria. After incorporating weights ϕ, a heuristic and stopping criteria are almost the same, with the one difference, that values of Ei are computed as stated in (1).
Incorporating a Priori Knowledge from Detractor Points into SVC
2.3
337
Reduce a Model by Removing Support Vectors
We use the method of removing support vectors to decrease the SVC model complexity. Reduced models are more suitable for further processing, e.g. for testing new samples. However reduced models have the disadvantage that generalization performance could be worse than for the original full model. The reduced models were recently proposed for Support Vector Regression [6], which solves a regression problem. We propose a new method which generates reduced models for classification problems. The proposed method generates reduced models from the original full model with incorporated a priori knowledge in the form of detractors. Reduced models with the additional a priori knowledge have better generalization performance compared to the reduced models without the additional knowledge. The procedure of generating knowledge in the form of detractors is as following. First, detractors are automatically generated from an existing solution, by setting: ϕi = yi g ∗ (ai ) − 1 for training samples for which ϕi > 0. Note that a number of detractors depends on data. It is possible, that no detractors would be generated for solutions when all training samples are support vectors. In this situation detractors could be generated automatically by adding the new samples with functional margins greater than one. Although this special case was not tested in this article. After that, a reduced model is generated by removing a bunch of support vectors – randomly selected support vectors, with maximal removal ratio of p% of all training vectors, where p is a configurable parameter. Finally, we run the SVC method with reduced data.
3
Experiments
In experiments, we show that the reduced models with knowledge in the form of detractors have better performance than without the additional knowledge. The first method does not use knowledge in the form of detractors in reduced models, the second one use the additional knowledge. In the first experiment, we set arbitrarily p = 70. Note that for comparison purposes a reduced model is the same for both methods. We use the author implementation of SVC for both methods. In the second experiment, we show that the proposed method has better performance for variable p. For all data sets, every feature is scaled linearly to [0, 1] including an output. For variable parameters like the C, σ for the RBF kernel, ϕ for SVCR, and for -SVR we use a grid search method for finding best values. The number of values searched by the grid method is a trade-off between an accuracy and a speed of simulations. Note that for particular data set it is possible to use more accurate grid searches than for massive tests with multiple number of simulations.
338
M. Orchel
Table 1. Description of test cases with results for synthetic data for generating reduced models by removing support vectors. Column descriptions: a function – kerP dim−1 dim−1 x , y , y = xi , a function used for generating data y1 = i 4 5 i=1 i=1 dim−1 y5 = 0.5 i=1 sin 10xi + 0.5, p - the parameter p, a value set means that results are averaged for two values of p, 20 and 50, simC – a number of simulations, results are averaged, σ – a standard deviation used for generating noise in output, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, tes – a testing set size, dim – a dimension of the problem, tr12M – a percentage average difference in correctly classified samples for training data, if greater than 0 than a method with detractors is better, te12M – the same as tr12M, but for testing data, tr12MC – a percentage average difference in number of tests for training data in which a method with detractors is better (the method with detractors is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for a method without detractors, s2 – an average number of support vectors for a method with detractors, d – an average number of detractors. The value ’var’ means that we search for the best value comparing a number of correctly classified samples for the training data.
function
p simC σ
y1 70 100 y2 = 3y1 70 100 y3 = 1/3y1 70 100 y4 70 100 y5 70 50 y6 70 10
3.1
0.04 0.04 0.04 0.04 0.04 0.04
ker kerP trs tes dim tr12M te12M tr12MC te12MC s1 s2 d lin lin lin pol rbf rbf
– – – 3 var var
90 90 90 90 90 90
300 300 300 300 300 300
5 5 5 5 5 5
64% 5% 70% 1.5% 10% 3%
51% 4% 50% 0% 6% 2%
94% 50% 95% 50% 80% 30%
90% 50% 90% 45% 60% 30%
14 20 12 20 21 25
18 21 15 20 20 25
17 10 20 5 10 5
Synthetic Data Tests
We compare both methods on data generated from particular functions with added Gaussian noise for output values. We perform tests with a linear kernel on linear functions, with a polynomial kernel on the polynomial function, with the RBF kernel on the sine function. The tests with results are presented in Tab. 1. The method with knowledge in the form of detractors has better performance for every kernel, a number of support vectors is comparable. A testing performance gain varies from 0% to 51%. 3.2
Real World Data Sets
The real world data sets were taken from the LibSVM site [1] [9] except stock price data. The stock price data consist of monthly prices of index DJIA from 1898 up to 2010. We generated the sample data set as follows: for every month the output value is a growth/fall comparing to the next month. Every feature i is a percentage price change between the month and the i-th previous month.
Incorporating a Priori Knowledge from Detractor Points into SVC
339
Table 2. Description of test cases with results for real world data for generating reduced models by removing support vectors. Column descriptions: a name – a name of the test, breastCr – breastCancer test, p – the parameter p, a value set means that results are averaged for two values of p, 20 and 50, simT – a number of random simulations, where training data are randomly selected, results are averaged, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, all – a number of all data, it is a sum of training and testing data, dim – a dimension of the problem, tr12M – a percentage average difference in correctly classified samples for training data, if greater than 0 than a method with detractors is better, te12M – the same as tr12M, but for testing data, tr12MC – a percentage average difference in number of tests for training data in which a method with detractors is better (a method with detractors is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for a method without detractors, s2 – an average number of support vectors for a method with detractors, d – an average number of detractors. The value ’var’ means that we search for the best value comparing a number of correctly classified samples for the training data. name
p simT ker kerP trs all
dim tr12M te12M tr12MC te12MC s1 s2 d
a1a1 a1a2 a1a3 breastCr1 breastCr2 breastCr3 diabetes1 diabetes2 diabetes3 djia1 djia2 djia3
70 70 70 70 70 70 70 70 70 70 70 70
123 123 123 10 10 10 8 8 8 10 10 10
100 100 20 100 100 20 100 100 20 100 100 20
lin pol rbf lin pol rbf lin pol rbf lin pol rbf
– 3 var – 3 var – 3 var – 3 var
90 90 90 90 90 90 90 90 90 90 90 90
5835 5835 5835 639 639 639 768 768 768 1351 1351 1351
20% 3% 35% 40% 45% 23% 8% 10% 10% 2% 3% 4%
6% 1% 5% 11% 27% 10% 4% 2% 5% 0% 0% 0%
85% 40% 100% 65% 73% 45% 65% 75% 80% 40% 55% 50%
80% 75% 70% 70% 82% 60% 70% 60% 60% 40% 50% 40%
17 25 23 7 6 13 15 15 19 20 19 25
16 25 33 7 7 13 15 14 18 20 19 23
16 8 33 25 24 20 13 15 12 5 7 4
In every simulation, training data are randomly chosen, the remaining samples become test data. The tests with results are presented in Tab. 2. The method with knowledge in the form detractors has better performance for all data sets, for all kernels with similar number of support vectors. The testing performance varies from 0% to 27%. For the djia data set, results are comparable. 3.3
Variable p
In the second experiment, we show that for variable p the second method has better performance. The example results for the first test case from synthetic tests are depicted in Fig. 2.
340
M. Orchel
0
0
-20 -50 -40 -100
-60 -80
-150
-100
-200
-120 10
20
30
40
50
60
70
10
20
30
40
50
60
70
Fig. 2. A comparison of two methods of removing support vectors for the first test case from Tab. 1. On x axis there is a p parameter in percents, on y axis there is a percentage difference in misclassified training samples in the left figure, and misclassified testing samples in the right figure between the original method without removing support vectors, and the method with removing procedure applied. The line with + points represents a random removing method, while the line with x points represents proposed removing method with knowledge in the form of detractors.
4
Conclusions
We show experimentally that knowledge in the form of detractors allows to construct reduced SVC models from the existing ones with better performance than models without that knowledge. While removing support vectors, new models are generated with drastically less support vectors. When additionally detractors are used, new models preserve good generalization performance. Acknowledgments. The research is financed by the Polish Ministry of Science and Higher Education project No NN519579338. I would like to express my sincere gratitude to Professor Witold Dzwinel (AGH University of Science and Technology, Department of Computer Science) for contributing ideas, discussion and useful suggestions.
References 1. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm 2. Fung, G.M., Mangasarian, O.L., Shavlik, J.: Knowledge-based nonlinear kernel classifiers. In: Sch¨ olkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 102–113. Springer, Heidelberg (2003)
Incorporating a Priori Knowledge from Detractor Points into SVC
341
3. Fung, G.M., Mangasarian, O.L., Shavlik, J.W.: Knowledge-based support vector machine classifiers. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, vol. 15, pp. 521–528. MIT Press, Cambridge (2003) 4. Jean-Baptiste Pothin, C.R.: Incorporating prior information into support vector machines in the form of ellipsoidal knowledge sets (2006) 5. Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML 1999: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann Publishers Inc., San Francisco (1999) 6. Karasuyama, M., Takeuchi, I., Nakano, R.: Reducing svr support vectors by using backward deletion. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part III. LNCS (LNAI), vol. 5179, pp. 76–83. Springer, Heidelberg (2008) 7. Lauer, F., Bloch, G.: Incorporating prior knowledge in support vector machines for classification: A review. Neurocomput. 71(7-9), 1578–1594 (2008) 8. Le, Q.V., Smola, A.J., G¨ artner, T.: Simpler knowledge-based support vector machines. In: ICML 2006: Proceedings of the 23rd International Conference on Machine Learning, pp. 521–528. ACM, New York (2006) 9. Libsvm data sets, http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools/datasets/ 10. Lin, C.F., Wang, S.D.: Fuzzy support vector machines. IEEE Transaction on Neural Networks 13(2), 464–471 (2002) 11. Mangasarian, O.L., Wild, E.W.: Nonlinear knowledge-based classification. IEEE Transactions on Neural Networks 19(10), 1826–1832 (2008) 12. Orchel, M.: Support vector machines: Sequential multidimensional subsolver (sms). In: Dabrowski (professor), A. (ed.) Signal Processing: Algorithms, Architectures, Arrangements, and Applications SPA 2007. IEEE - The Institute of Electrical and Electronics Engineers Inc. Region 8 - Europe, Middle East and Africa. Chapter Circuits and Systems. Poland Section. Poznan University of Technology. Faculty of Computing Science and Management. Division of Signal Processing and Electronic Systems, pp. 135–140 (September 2007) 13. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization, pp. 185–208. MIT Press, Cambridge (1999) 14. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience, Hoboken (1998) 15. Wang, L., Xue, P., Chan, K.L.: Incorporating prior knowledge into svm for image retrieval. In: ICPR 2004: 17th International Conference on Proceedings of the Pattern Recognition, vol. 2, pp. 981–984. IEEE Computer Society, Washington, DC (2004) 16. Wang, M., Yang, J., Liu, G.P., Xu, Z.J., Chou, K.C.: Weighted-support vector machines for predicting membrane protein types based on pseudo amino acid composition. Protein Engineering, Design & Selection 17(6), 509–516 (2004) 17. Wu, X., Srihari, R.: Incorporating prior knowledge with weighted margin support vector machines. In: KDD 2004: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 326–333. ACM, New York (2004)
A Hybrid AIS-SVM Ensemble Approach for Text Classification Mário Antunes1,3 , Catarina Silva1,2 , Bernardete Ribeiro2 , and Manuel Correia3 1
2
Computer Science Communication and Research Centre, School of Technology and Management, Polytechnic Institute of Leiria, Portugal {mario.antunes,catarina}@ipleiria.pt Department of Informatics Engineering, Center for Informatics and Systems of the University of Coimbra (CISUC), Portugal {catarina,bribeiro}@dei.uc.pt 3 Faculty of Science, University of Porto, Center for Research in Advanced Computing Systems (CRACS), Portugal
[email protected]
Abstract. In this paper we propose and analyse methods for expanding state-of-the-art performance on text classification. We put forward an ensemble-based structure that includes Support Vector Machines (SVM) and Artificial Immune Systems (AIS). The underpinning idea is that SVMlike approaches can be enhanced with AIS approaches which can capture dynamics in models. While having radically different genesis, and probably because of that, SVM and AIS can cooperate in a committee setting, using a heterogeneous ensemble to improve overall performance, including a confidence on each system classification as the differentiating factor. Results on the well-known Reuters-21578 benchmark are presented, showing promising classification performance gains, resulting in a classification that improves upon all baseline contributors of the ensemble committee. Keywords: Artificial Immune System, Support Vector Machine, Text Classification, Tunable Activation Threshold, Ensembles, Hybrid System.
1
Introduction
In the last decades the production of textual documents in digital form has increased exponentially, due to the increased availability of hardware and software [1]. As a consequence, there is an ever-increasing need for automated solutions to organize the huge amount of digital texts produced, in applications such as document processing and visualization, Web mining, digital information search and patent analysis. The task in text classification is often defined as assigning previously defined classes to documents (natural language texts) by analysing their content. While many techniques have successfully been used in tackling the problem of text classification, current research is focused on kernelbased algorithms mainly due to their performance accuracy and sparsity of the final solution. Examples are Vapnik’s Support Vector Machine (SVM) [2] which A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 342–352, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Hybrid AIS-SVM Ensemble Approach for Text Classification
343
implement the principle of structural minimization and different solutions based on committees of kernel-based machines, such as boosting. On the other hand, a bubbling field of research are Artificial Immune Systems (AIS) [3]. AIS takes advantage of the Vertebrate Immune System (IS) cognitive features to defend the body from external agents (pathogens). These features are expressed by two temporal scales: one corresponding to the somatic experience of each individual throughout their life and another related to the germ-line history of the species [4].The former is related to the fact that each one of us is continuously exposed to a myriad of unseen pathogens, relying on our own IS to distinguish, at each given moment in time, pathogens that belong to the organism’s own healthy cells and tissues (self ), from those that may correspond to an harmful pathogen (non-self ). The latter assumes that the capacity to detect open-ended abnormal behavior (anomalies) has been developed by natural selection during the evolution of the IS, tuning its innate functions of defence to appropriate values, similar in all individuals of the same species. The IS provides thus a very appealing and rich source of inspiration for the development of innovative detection systems applied to dynamic real world environments, like network intrusion detection [5] and spam filtering [6,7]. These are clear examples in which the detection system is obliged to continuously adjust itself according to the temporal events it processes. There are also some examples of AIS applied to text classification [8, 9, 10]. In [10] an artificial immune system approach to semantic document classification is presented, centering the goals on semantic interpretation rather than text classification. In [8] an agent-based model to classify biomedical articles is introduced, but results are still far from state-of-the-art. In [9] a statistical model is described to detect anomalies based in self/non-self discrimination in strings. In this work, the underpinning idea for the proposed framework is that SVMlike approaches and AIS approaches, while having radically different genesis, and probably because of that, can cooperate in a committee setting, using an heterogeneous ensemble to improve overall performance. SVM cutting-edge performance is enhanced with AIS capabilities of grasping dynamics in concepts present in real data sets. We introduce a framework where SVM and AIS share data and participate as equals partners, providing classifications and confidence levels to obtain a resulting classification that improves on all baseline contributors of the ensemble committee. The rest of the paper is organized as follows. We start by presenting in Section 2 the fundamentals of the baseline AIS and SVM learning systems. We then proceed in Section 3 by describing the proposed hybrid AIS-SVM ensemble framework. Then, we show and discuss the results obtained on processing Reuters-21578 data set. Finally, in Section 5 we discuss the conclusions of our work and terminate by delineating some future work.
2
Background
Here we describe the fundamentals of AIS, SVM and committee-based learning.
344
2.1
M. Antunes et al.
An Immune Model Inspired on Tunable Activation Thresholds
The two most popular immunological theories that are being used on AIS deployment for anomaly detection are Negative Selection (NS) and Danger Theory (DT). Despite the promising results achieved thus far, they proved to have some well documented difficulties in dealing with real world problems [5]. More recently a new branch of immunological theories have been applied on new AIS deployments for anomaly detection. One of such theories is the Tunable Activation Threshold (TAT), which postulates that self tolerance and non-self discrimination are made by the tunable adjustment of immune cells activation thresholds [11, 12]. Generally speaking, in such a model, immune cells (like T-cells) tune up and update their responsiveness according to the stimuli received from the environment throughout time. Each antigen undergoes a phagocytosis process which generates a set of corresponding peptides identified by a pattern representative (ligand ). These peptides are presented to the T-cells repertoire by a specific kind of cell, named the Antigen Presenting Cell (APC). For each presented peptide, the stimulus, or signal, is going to provoke a perturbation that is measured as a function of its concentration in the APC and the affinity between its ligand and the T-cell pattern representative (T-cell Receptor (TCR)). Thus, higher the concentration of a peptide and/or its affinity with the TCR, the higher the perturbation received by the T-cell. We adopted a minimal TAT model derived from [12] in which the activation threshold of a cell is tunable by the activity of two specific enzymes that respond to antigenic signals (S): Kinase (K) and Phosphatase (P ). Assuming {P0 ,K0 } as the basal values, for each time iteration i, the values for K and P are given by the linear equations 1 and 2: min((S + S0 ) · τ K, Ki−1 + = φK · t); if (S + S0 ) · τ K) > Ki−1 (1) Ki = max((S + S0 ) · τ K, Ki−1 − = φK · t); otherwise min((S + S0 ) · τ P, Pi−1 + = φP · t); if (S + S0 ) · τ P ) > Pi−1 Pi = (2) max((S + S0 ) · τ P, Pi−1 − = φP · t); otherwise Generally, if a T-cell receives a signal (S > 0), K and P should increase linearly until a turnover point (τ K and τ P ) is reached. The slope for K and P , as well as the rate of growth are defined by φK, φP and t respectively. Similarly, during signaling absence, K returns to the basal level at a faster rate than P . It is also assumed that T-cell activation is a switch-type response that requires that K supersedes P , at least transiently. Thus, for the same signal, K increases faster than P (φK > φP ), but if the signal persists P will supersede K and reach a higher plateau (τ P > τ K). According to the TAT model, those auto-reactive T-cells that are continuously stimulated by self antigens end up adapting its level of responsiveness and thus preventing from mounting an immune response. On the other side, those that are sporadically stimulated with a strong stimulus become activated and start an immune response [11]. In order to strengthen the recent temporal events a T-cell as been exposed to, S is calculated as a function of the affinity between the TCR and the peptides
A Hybrid AIS-SVM Ensemble Approach for Text Classification
345
ligand that exists in the APC lifespan (LS). This means that, for each T-cell, S reflects not only the signal sent by the bound peptides in the APC, but also by others, such as those that have been recently processed and memorised in the APCs whose lifetime has not yet expired [13]. The immune response is populational based, instead of being a simple consequence of the activation of just one single cell [11, 12]. Thus, in the TAT model, the classification of each APC is decided by a committee of T-cells that become active (with K > P ) for each processed APC, with its threshold termed Ct. This parameter starts with a predefined reasonable value and it is adjusted in run time, by a fixed value Inc, according to the observed evidences. TAT behavior is reproduced by a generic and context-independent TAT simulator for anomaly detection [13]. In order to cope with the text classification as being an anomaly detection task, for each category of the Reuters-21578 data set we label the positive examples as “Alert” and the remaining as corresponding to the “normal” behaviour. In this way, a trigger should thus be raised on the presence of an example of the category we are looking for. In the text classification an APC corresponds to a text document and its peptide ligands are the words on it. The T-cells repertoire correspond to the list of words managed by the system that tries to bind those presented on each document. For the sake of simplicity, the affinity between strings representative of T-cells and peptides is equal to 1 if the strings are equal and zero otherwise. 2.2
Support Vector Machines
SVMs are a learning method introduced by Vapnik [2] based on his Statistical Learning Theory and Structural Risk Minimization Principle. When using SVMs for classification, the basic idea is to find the optimal separating hyperplane between the positive and negative examples. The optimal hyperplane is defined as the one giving the maximum margin between the training examples that are closest to it. Support vectors are the examples that lie closest to the separating hyperplane. Once this hyperplane is found, new examples can be classified simply by determining on which side of the hyperplane they are. Although text categorization is a multi-class, multi-label problem, it can be broken into a number of binary class problems without loss of generality. This means that instead of classifying each document into all available categories, for each pair {document, category} we have a two class problem: the document either belongs or does not to the category. Although there are several linear classifiers that can separate both classes, only one, the Optimal Separating Hyperplane, maximizes the margin, i.e., the distance to the nearest data point of each class, thus presenting better generalization potential. The output of a linear SVM is u = w × x − b, where w is the normal weight vector to the hyperplane and x is the input vector. Maximizing the margin can be seen as an optimization problem: 1 minimize ||w||2 , 2 (3) subjected to yi (w.x + b) ≥ 1, ∀i,
346
M. Antunes et al.
where x is the training example and yi is the correct output for the ith training example. Intuitively the classifier with the largest margin will give low expected risk, and hence better generalization. To deal with the constrained optimization problem in (3) Lagrange multipliers αi ≥ 0 and the Lagrangian (4) can be introduced: l
1 Lp ≡ ||w||2 − αi (yi (w.x + b) − 1). 2 i=1
(4)
The Lagrangian has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables αi (i.e. a saddle point has to be found) [14]. SVM are universal learners. In their basic form, SVM learn linear threshold functions. However, using an appropriate kernel function, they can be used to learn polynomial classifiers, radial-basis function networks and three layer sigmoid neural networks. 2.3
Committee Classification Approaches
Classifier committees or ensembles are based on the idea that, given a task that requires expert knowledge, k experts may perform better than one, if their individual judgments are appropriately combined. A classifier committee is then characterized by (i) a choice of k classifiers, and (ii) a choice of a combination function, usually denominated a voting algorithm. The classifiers should be as independent as possible to guarantee a large number of inductions on the data. By using different classifiers to exploit diverse patterns of errors to make the ensemble better than just the sum (or average) of the parts, we can obtain a gain from potential synergies existing between the different ensemble classifiers [15].
3
Proposed Approach
This section presents the proposed AIS-SVM ensemble structure. There are several methods to create the set of elements in an ensemble, such as, different training samples, applying diverse preprocessing methods or using various learning parameters. The conjugation of their results can also be accomplished in a number of ways, like weighted average or majority voting. Having in this case two radically different approaches to structure an ensemble framework, we defined a two-level hybrid model illustrated in Figure 1 that joins the predictions of both SVM and TAT-based models. During the training phase the models are dealt with separately, i.e. a number n of classifiers is generated by varying SVM parameters and a number m of classifiers is generated varying the TAT parameters. On the other hand, for the testing phase, first each model is called to independently classify a testing example, and then two sets are constructed, one for each type of model (SVM and TAT). We then apply a majority voting strategy to each set to define its decision, i.e. if the document is a positive or negative example of the class.
A Hybrid AIS-SVM Ensemble Approach for Text Classification
347
Fig. 1. TAT based and SVM hybrid model for text classification
When both SVM and TAT sets agree on the classification of the testing example the two-level model outputs directly their consensus decision. However, if both sets majority voting disagree or tie (ties can happen when n or m are even), a different algorithm must be in place. We defined a heuristic voting rule based on a maximum confidence factor, D, of each set decision, as described in Algorithm 1.The set with higher confidence will define the output of the twolevel hybrid model in Figure 1. To linearly scale the confidence, D must be same for both sets of models. In our experiments, detailed in Section 4, we used n = 3, m = 4 and D = 4.
# " ! "
P sum = n i=1 SV Mi base = 1 base = 0.5 pred = 0.5 pred = 1 SV M Conf idence = base ∗ pred linearscale(SV M Conf idence, 0, D) T AT Conf idence = 1 T AT Conf idence = 0 agree T AT Conf idence = linearscale( disagree , 0, 1) linearscale(T AT Conf idence, 0, D)
4
Experimental Evaluation and Results
4.1
Reuters-21578 Benchmark
The widely used Reuters-21578 benchmark was used in the experiments. It is a financial corpus with news articles documents averaging 200 words each. Reuters21578 is publicly available1 and its corpus has 21,578 documents classified into 1
http://kdd.ics.uci.edu/databases/reuters-21578/reuters21578.html
348
M. Antunes et al.
Table 1. Number of positive training and testing documents for the Reuters-21578 most frequent categories Category Train Test
Category Train Test
Earn Acq Money-fx Grain Crude
Trade Interest Ship Wheat Corn
2715 1547 496 395 358
1044 680 161 138 176
346 313 186 194 164
113 121 89 66 52
118 categories. It is a very heterogeneous corpus, since the number of documents assigned to each category is very variable. There are documents not assigned to any of the categories and documents assigned to more than 10 categories. On the other hand, the number of documents assigned to each category is also not constant. There are categories with only one assigned document and others with thousands of assigned documents. The ModApte split was used, using 75% of the articles (9603 items) for training and 25% (3299 items) for testing. Table 1 presents the 10 most frequent categories and the number of positive training and testing examples. These 10 categories are widely accepted as a benchmark, since 75% of the documents belong to at least one of them. 4.2
Data Set Analysis for TAT Processing
In the TAT model the activation threshold of each T-cell is adjusted in a temporal basis and its value reflects the historical iterations with the environment, measured by signal intensity. When applied to text classification, this signal intensity reflects the concentration of words in each document presented in a timely ordered data set. Thus, a data set for which we may expect a good performance with TAT should be two-fold. It has to have a comprehensive set of words that appear recurrently through time thus inducing a subset of the T-cells repertoire to become quiescently; and it also has to have another set of words that appear sporadically but with a high concentration, thus allowing a group of T-cells in the repertoire to be activated in the presence of such a received strong signal. Figure 2 clearly illustrates the peptides distribution among the various classes of documents presented in the data set. From the ten data sets of Reuters-21578, only in the data set related to the earn category we are able to find a clear distinction between those two classes (Figure 2(a)). On the remaining data sets the shape is similar to those shown in Figures 2(b) and 2(c). In these cases the normal behavior is dominant, in that their representative words appear on a much larger amount when compared with such representative of anomalous behavior (class “Alert”). Figure 2(d) stress this fact by depicting the occurrences of each word in both classes, for all the categories.
A Hybrid AIS-SVM Ensemble Approach for Text Classification
4.3
349
Performance Metrics
In order to evaluate a binary decision task we first define a contingency matrix representing the possible outcomes of the classification, namely the True Positive (TP - positive examples classified as positive), the True Negative (TN - negative examples classified as negative), False Positive (FP - negative examples classified as positive) and False Negative (FN - positive examples classified as negative). Distribution of words in normal and alert classes - Dataset grain
Distribution of words in normal and alert classes - Dataset earn
35000
30000
Normal Alert
Normal Alert 30000
25000 25000
frequency
frequency
20000
15000
20000
15000
10000 10000
5000
5000
0
0 0
50
100
150
200
250
300
350
450
400
0
500
100
50
150
200
250
300
350
400
450
500
words
words
(b) Category=grain
(a) Category=earn Distribution of words in normal and alert classes - Dataset wheat
Distribution of words in normal and alert classes
35000
8000
Normal Alert
acq crude corn earn grain interest money-fx ship trade wheat
6000
30000
4000
25000
frequency
frequency
2000 20000
15000
0
-2000 10000 -4000 5000
-6000
0 0
50
100
150
200
250
300
-8000 350
400
words
(c) Category=wheat
450
500
0
100
200
300
400
500
words
(d) All categories.
Fig. 2. Words distribution by class in the Reuters-21578 data set
Several measures have been defined based on this contingency table, such as, +T N TP TP error rate ( T p+TFNP+F P +F N ), recall ( T P +F N ), and precision ( T P +F P ), as well as combined measures, such as, the van Rijsbergen Fβ measure, which combines 2 ×R recall and precision in a single score, Fβ = (β β+1)P . The latter is one of the 2 P +R best suited measures for text classification used with β = 1, i.e. F1 , and thus the results reported in this paper are macro-averaged F1 values. 4.4
Results and Analysis
Our working hypothesis is that an AIS-SVM ensemble model is able to produce a better text classification than each one isolated. According to TAT, this is achieved by a self/non-self distinction process based on the temporal historic frequencies of patterns presented in past documents. Through time, the T-cells that recognise frequent patterns become inactive and evolve to a quiescent state, while those that detect sporadic patterns within APCs with a reasonable concentration, become reactive thus initiating an immune response. We have conducted experiments with the earn data set using the processing parameters and criteria illustrated in the following. For SVM we also explored different parameters2 , resulting in three different learning machines: 2
http://svmlight.joachims.org
350
M. Antunes et al.
• SV M1 : Linear default kernel • SV M2 : Linear kernel with trade-off C, training error vs margin, set to 100 • SV M3 : Linear kernel with the cost-factor (by which training errors in positive examples outweigh errors in negative examples) set to 2 For TAT we used a set of fixed values for LS, Ct and Inc, together with a Latin Hypercube (LHC) sampling generator to obtain the multidimensional squares for the remaining parameters φ, τ and t. TAT training phase has two distinct data sets. The validation data set that has only examples of the earn class and the calibration, which contains examples of all the classes, is used to test the parameters set suggested by the LHC sampling generator. We then run each parameters set against the training data set, being the following those that achieved the best performance: • • • •
T AT1 : T AT2 : T AT3 : T AT4 :
φ = 0.038; φ = 0.038; φ = 0.031; φ = 0.062;
τ τ τ τ
= 0.939; = 0.939; = 0.921; = 0.942;
t = 0.00774; LS t = 0.00774; LS t = 0.00890; LS t = 0.00730; LS
= 5; Ct = 0.05; Inc = 0.005 = 15; Ct = 0.05; Inc = 0.005 = 5; Ct = 0.05; Inc = 0.005 = 5; Ct = 0.05; Inc = 0.005
Table 2 shows the results obtained with the AIS-SVM hybrid model described in Section 3. The performances attained by each model are presented, as well as the conjugated performance obtained with the ensemble model. From the evaluation of the experimental results we may observe an improvement of the results previously achieved by the standalone processing of the ensemble models. Although with a slight margin, the ensemble model was able to outperform the previous global results of F 1 achieved only with the SVM processing, mainly due to the decreasing of false positives. Despite their differences, we also observed that the union of such paradigms may bring substantial benefits to the final classification decision, by taking advantage of the individual features of each approach. From one side, SVM is currently the state-of-the art performance algorithm for text classification. On the Table 2. Results obtained with immune-SVM hybrid model
! **" *%% + **" T AT1 ## T AT2 #! T AT3 ## T AT4 *( + "" + ** SV M1 SV M2 SV M3
!"# !& * !& "#% "!( "# "## "&" !"%
$ #% !! #% (& ("" ($ (* ($( !&
%! %" * %" %$ $( %$ & "" %&
&'(&) "'"!) &$'!() "'"!) $&'$%) $"'!%) $%'**) $&'( ) $"'**) &'"*)
('(*) ('#) **) ('#) #$'*") #%'"*) #$'$) #$'*") ##'& ) ('##)
%'**) %'*#) (&'!() %'*#) !&' $) ! '*) !&'$%) !&'*!) !"'#$) ('(")
A Hybrid AIS-SVM Ensemble Approach for Text Classification
351
other side, the temporal self/non-self discrimination carried out by the immune system strongly inspires the use of AIS for such dynamic environments where the meaning of self and non-self changes throughout time, like text classification and spam detection.
5
Conclusions
We presented a hybrid approach for text classification, based on the ensemble of two rather different classification paradigms: a non adaptive machine learning SVM implementation and an immune-inspired approach based on the tunable activation thresholds of immune cells. Although they are grounded on different learning fundamentals, both approaches individually revealed distinctive features suitable to be used in text classification. Regarding the generic TAT based AIS framework previously deployed [13], it was also possible to confirm its flexibility on accomplishing the Reuters-21578 training and testing data sets processing, by converting the text classification into a binary classification problem. The preliminary results obtained thus far with this ensemble approach were very encouraging to proceed with this line of research. Further developments will be directed towards the enhancements that should be made to the preprocessing phase, since we are confident that this hybrid model may also produce satisfactory results in the classification of the other yet uncovered Reuters-21578 document classes. We also intend to apply this hybrid model to other contextual environments, for example those related to spam filtering.
References 1. Sebastiani, F.: Classification of text, automatic. In: Brown, K. (ed.) The Encyclopedia of Language and Linguistics, vol. 14, pp. 457–462. Elsevier, Amsterdam (2006) 2. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1999) 3. de Castro, L., Timmis, J.: Artificial Immune Systems: A New Computational Intelligence Approach. Springer, Heidelberg (2002) 4. Cohen, I.: Tending Adam’s Garden: evolving the cognitive immune self. Academic Press, San Diego (2004) 5. Kim, J., Bentley, P., Aickelin, U., Greensmith, J., Tedesco, G., Twycross, J.: Immune system approaches to intrusion detection - a review. Natural Computing 6(4), 413–466 (2007) 6. Abi-Haidar, A., Rocha, L.: Adaptive Spam Detection Inspired by the Immune System. In: Proc. of the 11th Int. Conference on the Simulation and Synthesis of Living Systems, vol. 11, pp. 1–8 (2008) 7. Oda, T., White, T.: Immunity from spam: An analysis of an artificial immune system for junk email detection. In: Jacob, C., Pilat, M.L., Bentley, P.J., Timmis, J.I. (eds.) ICARIS 2005. LNCS, vol. 3627, pp. 276–289. Springer, Heidelberg (2005) 8. Abi-Haidar, A., Rocha, L.: Biomedical article classification using an agent-based model of T-cell cross-regulation. In: Hart, E., McEwan, C., Timmis, J., Hone, A. (eds.) ICARIS 2010. LNCS, vol. 6209, pp. 237–249. Springer, Heidelberg (2010)
352
M. Antunes et al.
9. Pöllä, M.: A generative model for self/Non-self discrimination in strings. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 293–302. Springer, Heidelberg (2009) 10. Greensmith, J., Cayzer, S.: An artificial immune system approach to semantic document classification. In: Timmis, J., Bentley, P.J., Hart, E. (eds.) ICARIS 2003. LNCS, vol. 2787, pp. 136–146. Springer, Heidelberg (2003) 11. Grossman, Z., Paul, W.: Adaptive cellular interactions in the immune system: The tunable activation threshold and the significance of subthreshold responses. Proc. National Academy of Sciences 89(21), 10365–10369 (1992) 12. Carneiro, J., Paixão, T., Milutinovic, D., Sousa, J., Leon, K., Gardner, R., Faro, J.: Immunological self-tolerance: Lessons from mathematical modeling. J. Computational and Applied Mathematics 184(1), 77–100 (2005) 13. Antunes, M., Correia, M.: Self tolerance by tuning t-cell activation: an artificial immune system for anomaly detection. In: Lnicst, S. (ed.) Bionetics (2010) 14. Schölkopf, B., Burges, C., Smola, A.: Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998) 15. Kuncheva, L.: Combining Patt Classifiers, Methods and Algorithms. Wiley, Chichester (2004)
Regression Based on Support Vector Classification Marcin Orchel AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Krak´ ow, Poland
[email protected]
Abstract. In this article, we propose a novel regression method which is based solely on Support Vector Classification. Experiments show that the new method has comparable or better generalization performance than -insensitive Support Vector Regression. The tests were performed on synthetic data, on various publicly available regression data sets, and on stock price data. Furthermore, we demonstrate how a priori knowledge which has been already incorporated to Support Vector Classification for predicting indicator functions, could be directly used for a regression problem. Keywords: Support Vector Machines, a priori knowledge.
1
Introduction
One of the main learning problems is a regression estimation. Vapnik [6] proposed a new regression method, which is called -insensitive Support Vector Regression (-SVR). It belongs to a group of methods called Support Vector Machines (SVM). For estimating indicator functions the Support Vector Classification (SVC) method was developed. The SVM were invented on a basis of statistical learning theory. They are efficient learning methods partly for the reason of having the following important properties: they lead to convex optimization problems, they generate sparse solutions, kernel functions can be used for generating nonlinear solutions. In this article, we analyze the differences between -SVR and SVC. We list some advantages of SVC over -SVR, that are: -SVR has the additional free parameter , in -SVR the minimized term w2 is responsible for rewarding flat functions, while in SVC the same term has a meaning fully dependent on training data – it takes a part in finding a maximal margin hyperplane. This is the motivation for the proposed new regression method which is fully based on SVC. Additionally, the proposed method has an advantage while incorporating a priori knowledge into SVM. Incorporating a priori knowledge into SVM is an important task and is extensively researched recently [3]. It is a practice, that most of a priori knowledge is first incorporated to SVC. Additional effort is needed to introduce the same a priori knowledge for -SVR. We show on example that a particular type of a priori knowledge already incorporated to ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 353–362, 2011. c Springer-Verlag Berlin Heidelberg 2011
354
M. Orchel
SVC can be directly used for a regression problem by using the proposed method. Recently some attempts were made to combine SVC with -SVR [7]. They differ substantially from the proposed method by the fact, that the proposed method is a replacement for -SVR. 1.1
Introduction to -SVR and SVC
In a regression we consider a set of training samples aii for i = 1..n, estimation where ai = a1i , . . . , am i . The i-th training sample is mapped to yr ∈ IR. The m is a dimension of the problem. The -SVR soft case optimization problem is OP 1. Minimization of: n i 2 ξr + ξr∗i f wr , br , ξr , ξr∗ = wr + Cr i=1
with constraints: yri − g (ai ) ≤ + ξri , g (ai ) − yri ≤ + ξri∗ , ξr ≥ 0, ξr∗ ≥ 0 for i ∈ {1..n}, where g (ai ) = wr · ai + br . The g ∗ (x) = wr∗ · x + b∗r is a regression function. Optimization problem 1 is transformed to an equivalent dual optimization problem. The regression function becomes n (α∗i − βi∗ ) K (ai , x) + b∗r , (1) g ∗ (x) = i=1
where αi , βi are Lagrange multipliers of the dual problem, K (·, ·) is a kernel function, which is incorporated to a dual problem. The most popular kernel functions are linear, polynomial, radial basis function (RBF) and sigmoid. A kernel function which is a dot product of its variables we call a simple linear kernel. The i-th training sample is a support vector, when α∗i − βi∗ = 0. It can be proved that a set of support vectors contains all training samples which fall outside the tube, and some of the samples, which lie on the tube. The conclusion is that a number of support vectors can be controlled by a tube height (the ). For an indicator function we consider a set of training samples ai estimation, for i = 1..n, where ai = a1i , . . . , am i . The i-th training sample is mapped to yci ∈ {0, 1}. The m is a dimension of the problem. The SVC 1-norm soft margin case optimization problem is OP 2. Minimization of: f (wc , bc , ξc ) = wc 2 + Cc
n
ξci
i=1
with constraints: yci h (ai ) ≥ 1 + ξci , ξc ≥ 0 for i ∈ {1..n}, where h (ai ) = wc · ai + bc .
Regression Based on Support Vector Classification
355
The h∗ (x) = wc∗ ·x+ b∗c = 0 is a decision curve of the classification problem. Optimization problem 2 is transformed to an equivalent dual optimization problem. The decision curve becomes h∗ (x) =
n
yci α∗i K (ai , x) + b∗c = 0 ,
(2)
i=1
where αi are Lagrange multipliers of the dual problem, K (·, ·) is a kernel function, which is incorporated to a dual problem. The i-th sample is a support vector, when α∗i = 0. It can be proved that a set of support vectors contains all training samples which fall outside optimal margin boundaries, and some of the samples, which lie on the optimal margin boundaries. Margin boundaries are defined as the two hyperplanes h (x) = −1 and h (x) = 1. Optimal margin boundaries are defined as the two hyperplanes h∗ (x) = −1 and h∗ (x) = 1. Comparing the number of free parameters of both methods, the -SVR has the additional parameter. The one of motivations of developing a regression method based on SVC is flatness property of the -SVR. The minimization term wr 2 in -SVR is related to the following property of a linear function: for two linear functions g1 (x) = w1 · x + b1 and g2 (x) = w2 · x + b2 , whenever w2 < w1 , then we can say that g2 (x) is flatter than g1 (x). Flatter functions are awarded by -SVR. Flatness property of a linear function is not related to training samples. It differs from SVC, where minimizing term wc 2 is related to training samples, because it is used for finding a maximal margin hyperplane.
2
Regression Based on SVC
We consider a set of training samples ai for i = 1..n, where ai = a1i , . . . , am i . The i-th training sample is mapped to yri ∈ IR. The proposed regression method (SVCR) is based on the following scheme of finding a regression function: 1. Every training sample ai is duplicated, an output value yri is translated by a value of a parameter ϕ > 0 for an original training sample, and translated by −ϕ for a duplicated training sample. 2. Every training sample ai is converted to a classification sample by incorporating the output as an additional feature and setting class 1 for original training samples, and class −1 for duplicated training samples. 3. SVC is run with the classification mappings. 4. The solution of SVC is converted to a regression form. The above procedure is repeated for different values of ϕ, for particular ϕ it is depicted in Fig. 1. The best solution among various ϕ is found based on the mean squared error (MSE) measure. The result of the first step is a set of training mappings for i ∈ {1, . . . , 2n} bi = (ai,1 , . . . , ai,m ) → yi + ϕ for i ∈ {1, . . . , n} bi = (ai−n,1 , . . . , ai−n,m ) → yi−n − ϕ for i ∈ {n + 1, . . . , 2n}
356
M. Orchel 1 1
0.8
0.8
0.6
0.6
0.4
0.4 0.2
0.2
0 0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. 1. In the left figure, there is an example of regression data for a function y = 0.5 sin 10x + 0.5 with Gaussian noise, in the right figure, the regression data are translated to classification data. With the ’+’ translated original samples are depicted, with the ’x’ translated duplicated samples are depicted.
for ϕ > 0. The parameter ϕ we call the translation parameter. The result of the second step is a set of training mappings for i ∈ {1, . . . , 2n} ci = (bi,1 , . . . , bi,m , yi + ϕ) → 1 for i ∈ {1, . . . , n} ci = (bi,1 , . . . , bi,m , yi−n − ϕ) → −1 for i ∈ {n + 1, . . . , 2n} for ϕ > 0. The dimension of the ci samples is equals to m + 1. The set of ai mappings is called a regression data setting, the set of ci ones is called a classification data setting. In the third step, we solve OP 2 with ci samples. Note that h∗ (x) is in the implicit form of the last coordinate of x. In the fourth step, we have to find an explicit form of the last coordinate. The explicit form is needed for example for testing new samples. The wc variable of the primal problem for a simple linear kernel case is found based on the solution of the dual problem in the following way: wc =
2n
yci αi ci .
i=1
For a simple linear kernel the explicit form of (2) is m − j=1 wcj xj − bc m+1 x = . wcm+1 The regression solution is g ∗ (x) = wr · x + br , where wri = −wci /wcm+1 , br = −bc /wcm+1 for i = 1..m. For nonlinear kernels, a conversion to the explicit form has some limitations. First, a decision curve could have more than one value
Regression Based on Support Vector Classification
357
of the last coordinate for specific values of remaining coordinates of x, thus it cannot be converted unambiguously to the function (e.g. a polynomial kernel with a dimension equals to 2). Second, even when the conversion to the function is possible, there is no explicit analytical formula (e.g. a polynomial kernel with a dimension greater than 4), or it is not easy to find it. Therefore, a special method for finding an explicit formula of the coordinate should be used, e.g. a bisection method. The disadvantage of this solution is a longer time of testing new samples. To overcome these problems, we propose a new kernel type in which the last coordinate is placed only inside a linear term. A new kernel is constructed from an original kernel by removing the last coordinate, and adding the linear term with the last coordinate. For the most popular kernels polynomial, radial basis function (RBF) and sigmoid, the conversions are respectively d
(x · y)
→
m
d xi yi
+ xm+1 ym+1 ,
(3)
i=1
x − y exp − 2σ 2
2
→ exp −
tanh xy → tanh
m i=1 m
2
(xi − yi ) + xm+1 ym+1 , 2σ 2
xi yi + xm+1 y m+1 .
(4) (5)
i=1
The proposed method of constructing new kernels always generates a function fulfilling Mercer’s condition, because it generates a function which is a sum of two kernels. For the new kernel type, the explicit form of (2) is x
m+1
=
−
2n
i i i i=1 yc αi Kr cr , xr 2n i m+1 i=1 yc αi ci
,
i 1 m where cir = c1i ..cm i , xr = xi ..xi . 2.1
Support Vectors
The SVCR runs the SVC method on duplicated number of samples. Thus, a maximal number of support vectors of SVC is 2n. The SVCR algorithm is constructed in the way, that while searching for the best value of ϕ, the cases for which a number of SVC support vectors is bigger than n are omitted. We prove the fact, that in this case the set of SVC support vectors does not contain any two training samples, where one of them is a duplicate of the another. Therefore, a set of SVC support vectors is a subset of the ai set of training samples. Let’s call a margin boundaries vector or an outside margin boundaries vector as an essential margin vector and a set of such vectors EM V . Theorem 1. The ai samples are not collinear and |EM V | ≤ n, implicates EM V does not contain duplicates.
358
M. Orchel
Proof (Proof sketch). Let’s assume, that the EM V contains a duplicate at of a sample at . Let p (·) = 0 be a hyperplane parallel to margin boundaries and containing the at . Therefore the set of EM V samples for which p (·) ≥ 0 has r elements, where r >= 1. Let p (·) = 0 be a hyperplane parallel to margin boundaries and containing the at . Therefore the set of EM V samples for which p (·) ≤ 0 has equals or greater than n − r + 1 elements. So |EM V | ≥ n + 1, which contradicts the assumption. For nonlinear case the same theorem is applied in induced feature kernel space. It can be proved that a set of support vectors is a subset of the EM V . Therefore the same theorem is applied for a set of support vectors. Experiments show that it is a rare situation when for any value of ϕ checked by SVCR, a set of support vectors has more than n elements. In such situation the best solution among violating the constraint is chosen. Here we consider how changes of a value of ϕ influence on a number of support vectors. First, we can see that for ϕ = 0, n ≤ |EM V | ≤ 2n. When for a particular value of ϕ both classes are separable then 0 ≤ |EM V | ≤ 2n. By a configuration of essential margin vectors we call a list of essential margin vectors, each with a distance to a one of the margin boundaries. Theorem 2. For two values of ϕ, ϕ1 > 0 and ϕ2 > 0, where ϕ1 > ϕ2 , for every margin boundaries for ϕ2 , there exist margin boundaries for ϕ1 with the same configuration of essential margin vectors. Proof (Proof sketch). Let’s consider the EM V for ϕ2 with particular margin boundaries. When increasing a value of ϕ by ϕ1 − ϕ2 in order to preserve the same configuration of essential margin vectors we extend margin bounded region by ϕ1 − ϕ2 on both sides. When increasing a value of ϕ, new sets of essential margin vectors arise, and all sets presented for the lower values of ϕ remains. When both classes become separable by a hyperplane, further increasing the value of ϕ does not change a collection of sets of essential margin vectors. The above suggests that increasing a value of ϕ would lead to solutions with lesser number of support vectors. 2.2
Comparison with -SVR
Both methods have the same number of free parameters. For -SVR: C, kernel parameters, and . For SVCR: C, kernel parameters and ϕ. When using a particular kernel function for -SVR and a related kernel function for SVCR, both methods have the same hypothesis space. Both parameters and ϕ control a number of support vectors. There is a slightly difference between these two methods when we compare configurations of essential margin vectors. For the case of -SVR, we define margin boundaries as a lower and upper tube boundaries. Among various values of the , every configuration of essential margin vectors is unique. In the SVCR, based on Thm. 2, configurations of essential margin vectors are repeated while a value of ϕ increases. This suggest that for particular values of ϕ and a set of configurations of essential margin vectors is richer for SVCR than for -SVR.
Regression Based on Support Vector Classification
3
359
Experiments
First, we compare performance of SVCR and -SVR on synthetic data and on publicly available regression data sets. Second, we show that by using SVCR, a priori knowledge in the form of detractors introduced in [5] for classification problems could be applied for regression problems. For the first part, we use a LibSVM [1] implementation of -SVR and we use LibSVM in solving SVC problems in SVCR. We use a version ported to Java. For the second part, we use the author implementation of SVC with detractors. For all data sets, every feature is scaled linearly to [0, 1] including an output. For variable parameters like the C, σ for the RBF kernel, ϕ for SVCR, and for -SVR, we use a grid search method for finding best values. The number of values searched by the grid method is a trade-off between an accuracy and a speed of simulations. Note that for particular data sets, it is possible to use more accurate grid searches than for massive tests with multiple number of simulations. The preliminary tests confirm that while ϕ is increased, a number of support vectors is decreased. 3.1
Synthetic Data Tests
We compare the SVCR and -SVR methods on data generated from particular functions with added Gaussian noise for output values. We perform tests with a linear kernel on linear functions, with a polynomial kernel on the polynomial function, with the RBF kernel on the sine function. The tests with results are presented in Tab. 1. We can notice generally slightly worse training performance for the SVCR. The reason is that -SVR directly minimizes the MSE. We can notice fairly good generalization performance for the SVCR, which is slightly better than for the -SVR. We can notice lesser number of support vectors for the SVCR method for linear kernels. For the RBF kernel the SVCR is slightly worse. 3.2
Real World Data Sets
The real world data sets were taken from the LibSVM site [1] [4] except stock price data. The stock price data consist of monthly prices of index DJIA from 1898 up to 2010. We generated the sample data set as follows: for every month the output value is a growth/fall comparing to the next month. Every feature i is a percentage price change between the month and the i-th previous month. In every simulation, training data are randomly chosen and the remaining samples become test data. The tests with results are presented in Tab. 2. For linear kernels, the tests show better generalization performance of the SVCR method. The performance gain on testing data is ranged from 0–2%. For the polynomial kernel, we can notice better generalization performance of the SVCR (performance gain from 68–80%). A number of support vectors is comparable for both
360
M. Orchel
Table 1. Description of test cases with results for synthetic data. Column descriptions:
kerP dim a function – a function used for generating data y1 = dim x , y = x , i 4 i i=1 i=1 dim y5 = 0.5 i=1 sin 10xi + 0.5, simC – a number of simulations, results are averaged, σ – a standard deviation used for generating noise in output, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, tes – a testing set size, dim – a dimension of the problem, tr12M – a percentage average difference in MSE for training data, if greater than 0 than SVCR is better, te12M – the same as tr12M, but for testing data, tr12MC – a percentage average difference in number of tests for training data in which SVCR is better (SVCR is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for -SVR, s2 – an average number of support vectors for SVCR. The value ’var’ means that we search for the best value comparing the training data MSE.
function
simC σ
y1 100 y2 = 3y1 100 y3 = 1/3y1 100 y4 100 y5 20
0.04 0.04 0.04 0.04 0.04
ker kerP trs tes dim tr12M te12M tr12MC te12MC s1 s2 lin lin lin pol rbf
– – – 3 var
90 90 90 90 90
300 300 300 300 300
4 4 4 4 4
0% −0.4% 0% −2% −500%
0.5% −0.4% 1% 10% −10%
20% 10% 50% 2% 30%
58% 40% 80% 80% 20%
50 74 50 61 90
46 49 40 61 90
methods. For the RBF kernel, results strongly depends on data: for two test cases the SVCR has better generalization performance (10%). Generally the tests show that the new method SVCR has good generalization performance on synthetic and real world data sets used in experiments and often it is better than for the -SVR. 3.3
Incorporating a Priori Knowledge in the Form of Detractors to SVCR
In the article [5], a concept of detractors was proposed for a classification case. Detractors were used for incorporating a priori knowledge in the form of a lower bound (a detractor parameter b) on a distance from a particular point (called a detractor point) to a decision surface. We show that we can use a concept of detractors directly in a regression case by using the SVCR method. We define a detractor for the SVCR method as a point with the parameter d, and a side (1 or −1). We modify the SVCR method in the following way: the detractor is added to a training data set, and transformed to the classification data setting in a way that when a side is 1: d = b + ϕ, for a duplicate d = 0; when a side is −1: d = 0, for a duplicate d = b − ϕ. The primal application of detractors was to model a decision function (i.e. moving far away from a detractor). A synthetic test shows that indeed we can use detractors for modeling a regression function. In Fig. 2, we can see that adding a detractor causes moving a regression function far away from the detractor.
Regression Based on Support Vector Classification
361
Table 2. Description of test cases with results for real world data. Column descriptions: a name – a name of the test, simT – a number of random simulations, where training data are randomly selected, results are averaged, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, all – a number of all data, it is a sum of training and testing data, dim – a dimension of the problem, tr12M – a percentage average difference in MSE for training data, if greater than 0 than SVCR is better, te12M – the same as tr12M, but for testing data, tr12MC – a percentage average difference in number of tests for training data in which SVCR is better (SVCR is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for -SVR, s2 – an average number of support vectors for SVCR. The value ’var’ means that we search for the best value comparing the training data MSE. name
simT ker kerP trs all
dim tr12M
abalone1 abalone2 abalone3 caData1 caData2 caData3 stock1 stock2 stock3
100 100 20 100 100 20 100 100 20
8 8 8 4 2 2 4 2 2
lin pol rbf lin pol rbf lin pol rbf
– 5 var – 5 var – 5 var
90 90 90 90 90 90 90 90 90
4177 4177 4177 4424 4424 4424 1351 1351 1351
−0.2% −90% 70% −1.5% −105% −25% 0% −4500% 76%
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
te12M tr12MC te12MC s1 s2 2% 80% 10% 2% 68% 10% 0% 78% −6%
20% 0% 90% 1% 0% 50% 40% 0% 100%
70% 100% 65% 55% 100% 50% 55% 100% 25%
0.4
0.6
35 78 90 41 79 90 35 90 90
38 73 90 44 75 90 32 87 90
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.8
1
Fig. 2. In the left figure, the best SVCR translation for particular regression data is depicted, in the right figure, the best SVCR translation for the same data, but with a detractor in a point (0.2, 0.1) and d = 10.0 is depicted. We can see that the detractor causes moving the regression function far away from it. Note that the best translation parameter ϕ is different for both cases.
362
4
M. Orchel
Conclusions
The SVCR method is an alternative for the -SVR. We focus on two advantages of the new method, first, a generalization performance of the SVCR is comparable or better than for the -SVR based on conducted experiments. Second, we show on the example of a priori knowledge in the form of detractors, that a priori knowledge already incorporated to SVC can be used for a regression problem solved by the SVCR. In such case, we do not have to analyze and implement the incorporation of a priori knowledge to the other regression methods (e.g. to the -SVR). Further analysis of the SVCR will concentrate on analysing and comparing the generalization performance of the proposed method in the framework of statistical learning theory. Just before submitting this paper, we have found in [2] very similar idea. However, the Authors solve an additional optimization problem in the testing phase to find a root of the nonlinear equation. Therefore two problems arise, multiple solutions and lack of solution. Instead, we propose a special type of kernels (3)(4)(5), which overcome these difficulties. In [2], the Authors claim that by modifying ϕ parameter for every sample in a way that the samples with lower and upper values of yi have lesser values of ϕ than the middle ones, a solution with lesser number of support vectors can be obtained. However, this modification leads to a necessity of tuning a value of an additional parameter during the training phase. Acknowledgments. The research is financed by the Polish Ministry of Science and Higher Education project No NN519579338. I would like to express my sincere gratitude to Professor Witold Dzwinel (AGH University of Science and Technology, Department of Computer Science) for contributing ideas, discussion and useful suggestions.
References 1. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm 2. Fuming Lin, J.G.: A novel support vector machine algorithm for solving nonlinear regression problems based on symmetrical points. In: Proceedings of the 2010 2nd International Conference on Computer Engineering and Technology, ICCET (2010) 3. Lauer, F., Bloch, G.: Incorporating prior knowledge in support vector machines for classification: A review. Neurocomput. 71(7-9), 1578–1594 (2008) 4. Libsvm data sets, http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools/datasets/ 5. Orchel, M.: Incorporating detractors into svm classification. In: Kacprzyk, P.J. (ed.) Man-Machine Interactions; Advances in Intelligent and Soft Computing, pp. 361– 369. Springer, Heidelberg (2009) 6. Vapnik, V.N.: Statistical Learning Theory. Wiley Interscience, Hoboken (1998) 7. Wu, C.A., Liu, H.B.: An improved support vector regression based on classification. In: Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering, MUE 2007, pp. 999–1003. IEEE Computer Society, Washington, DC (2007)
Two One-Pass Algorithms for Data Stream Classification Using Approximate MEBs Ricardo Ñanculef1 , Héctor Allende1 , Stefano Lodi2 , and Claudio Sartori2 1
2
Dept. of Informatics, Federico Santa María University, Chile {hallende,jnancu}@inf.utfsm.cl Dept. of Electronics, Computer Science and Systems, University of Bologna, Italy {claudio.sartori,stefano.lodi}@unibo.it
Abstract. It has been recently shown that the quadratic programming formulation underlying a number of kernel methods can be treated as a minimal enclosing ball (MEB) problem in a feature space where data has been previously embedded. Core Vector Machines (CVMs) in particular, make use of this equivalence in order to compute Support Vector Machines (SVMs) from very large datasets in the batch scenario. In this paper we study two algorithms for online classification which extend this family of algorithms to deal with large data streams. Both algorithms use analytical rules to adjust the model extracted from the stream instead of recomputing the entire solution on the augmented dataset. We show that these algorithms are more accurate than the current extension of CVMs to handle data streams using an analytical rule instead of solving large quadratic programs. Experiments also show that the online approaches are considerably more efficient than periodic computation of CVMs even though warm start is being used. Keywords: Data stream mining, Online learning, Kernel methods, Minimal enclosing balls.
1
Introduction
Datasets which continuously grow over time are referred to as data streams. Data mining operations such as classification, clustering and frequent pattern mining are considerably more challenging in data streams applications because frequently the volume of data is too large to be stored on disk or to be analyzed using multiple scans [1]. Approximate solutions to standard data mining problems can thus be reasonable alternatives if they provide a near-optimal answer in a timely and computationally efficient manner. In this paper we focus on the problem of online approximation of SVM classifiers from data streams using a single pass over the data. In contrast to batch algorithms where data is supposed to be available all in advance and allowed to be used as many times as desired along the model computation process, online learning takes place in a sequence of consecutive rounds
This work was supported by Research Grants 1110854 Fondecyt and Basal FB0821, “Centro Científico-Tecnológico de Valparaíso”, UTFSM.
A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 363–372, 2011. c Springer-Verlag Berlin Heidelberg 2011
364
R. Ñanculef et al.
in which the learner observes a new example, provides a prediction, receives feedback about the correct outcome and finally has the chance to update its prediction mechanism in order to make better predictions on subsequent rounds [14]. One-pass methods in addition process new data items at most once [1]. Online learners avoiding multiple thorough passes are highly desired in real-time applications in which the model extracted from data needs to be frequently adjusted to achieve more accurate results. These algorithms are expected to exhibit restricted memory requirements as well as fast prediction and model-computation times and thus can also be used to deal with very large datasets effectively. Our method is based on the equivalence between a class of SVM classifiers (L2SVMs) and the problem of computing the minimal enclosing ball (MEB) of a set of points in a dot-product space. This equivalence, originally presented in [17] for the construction of the so called Core Vector Machines (CVMs), has motivated several approaches to speed up kernel methods on large datasets [18] [15] [3] [10]. Up to our knowledge however only [12] has previously examined the use of this equivalence for the design of single-pass online classifiers. This method in turn is based on a method to estimate the MEB of a streaming sequence proposed in [21]. Although recently [19] has also addressed the computation of CVMs from data streams, this method is based on the periodic resolution of large quadratic programs which mostly require several passes through the dataset (see [11] for a survey on methods for SVM computation). As the method presented in [12], our method keeps track of a ball which reasonably approximates the MEB corresponding to the sequence of examples coming up to a given round. We study two novel analytical rules to adjust such a ball from new coming observations. Our simulations on two medium scale and three large scale classification datasets show that the obtained classifiers are more accurate than the ones proposed in [12] to handle data streams. Experiments also show that all the single-pass approaches studied in this paper are considerably more efficient than periodic computation of CVMs even though warm start is being used.
2
Pattern Classification Using MEBs
Given a set of items Sx = {xk : k ∈ I := {0, 1, . . . , T }} associated with an outcome sequence {yk : k ∈ I}, a typical machine learning task consists in designing a prediction mechanism h(x) termed hypothesis capable of mapping an input to a given outcome. In pattern classification this outcome represents a category or class that needs to be associated to a given item. In binary classification yk ∈ {+1, −1}, xk ∈ X ⊂ RN and h : X → {+1, −1}. 2.1
Kernel Methods
Kernel methods model the prediction mechanism h using functions from the space of linear classifiers, that is the predictions are computed using only dotproducts in a feature space. Since in realistic problems the configuration of the
Two One-Pass Algorithms for Data Stream Classification
365
data can be highly non-linear, kernel methods build the linear model not in the original space X of data but in a high-dimensional dot-product space Z = Lin(φ(X )) named the feature space, where the decision function can be linearly represented [8]. The feature space is related with the data space X by means of a function k : X × X → R called the kernel which computes the dot products zTi zj in Z directly from the points in the input space k(xi , xj ), avoiding the explicit computation of the mapping φ [13]. In this paper we will use z to refer a generic element of the feature space Z obtained as the image of an observation x under the mapping φ. In the feature space Z, the classification hypothesis takes the form h(z) = sgn (f (z))) where the discriminant function f (z) is represented as a separating hyperplane f (z) = wT φ(x)+b defined by means of a normal vector w ∈ Z and a position parameter b ∈ R. The vector w results by construction [13,8] equivalent to a superposition of featured items w= λi φ(xi ) , (1) i
such that the prediction mechanism can be implemented using only the kernel and the original data items: h(x) = sgn wT φ(x) + b = sgn yi λi k(xi , x) + b . (2) i
The weights λi which determine the prediction mechanism (2) are defined as the solution to an optimization problem which incorporates in the objective a measure of error or loss l(ˆ yk , yk ) on the dataset and other theoretically-founded criteria such as the sparseness of the solution which determines the memory required to store the model. Note that for a hypothesis of the form of (2), yf (x) > 0 if and only if the decision of the classifier and the true outcome coincides. The margin yf (x) provides thus a measure of confidence of the prediction. Kernel methods are usually built by using the soft-margin loss [9]: lρ (h(x), y) = max (0, ρ − yf (x)) . 2.2
(3)
L2-SVM Classification
The model of classification we will consider in the rest of this paper is that of L2-SVM classification [17]. In L2-SVM classification, the optimal separating hyperplane (w, b) for the dataset S is obtained as the solution to the problem min(w, b, ρ, ξ): 12 w2 + b2 + C · ξi2 − ρ (4) i
st: yi f (zi ) ≥ ρ − ξi ∀i ∈ I . which aims to simultaneously maximize the largest margin lρ (h(x), y) attained on the dataset and the margin parameter ρ of the soft-margin loss function (3).
366
R. Ñanculef et al.
The parameter C is a regularization parameter used by the model to handle noisy data [9]. It can be shown that the Lagrange-dual of this problem is max(α) : − αi αj (yi yj (k(xi , xj ) + 1) + δ(i, j)/C) (5) i,j∈I st: αi = 1, αi ≥ 0 ∀i ∈ I . i∈I
From strong duality it can be shown that the parameters (w, b) for the L2-SVM model are w= yi αi φ(xi ) b = yi α i , (6) i∈I
i∈I
and the parameters λi of expansion (1) are given by λi = yi αi . The margin parameter is additionally given by ρ = i,j∈I αi αj (yi yj k(xi , xj ) + yi yj + δ(i, j)/C). 2.3
Minimal Enclosing Balls (MEBs)
As shown first by [17] and then generalized in [18], several kernel methods can be formulated as the problem of computing the minimal enclosing ball (MEB) of a set of feature points D = {zi : i ∈ I} in a dot-product space Z. The MEB of D, denoted by BS (c∗ , r∗ ) is defined as the smallest ball in Z containing D. As shown in [20], the Lagrange-dual of the quadratic programming formulation of this problem is max(α) : Φ(α) := zTi zi − αi αj zTi zj (7) i∈I i,j∈I st: αi = 1, αi ≥ 0 ∀i ∈ I , i∈I
Suppose now that the set D corresponds to the image of a set S = {˜ xk = (xk , yk ) : k ∈ I := {1, 2, . . . , T }} under a mapping ϕ : X × Y → Z, such that zTi zj = ϕ(˜ xi )T ϕ(˜ xj ) = kϕ (˜ xi , x ˜j ) ∀i, j ∈ I for a given kernel function kϕ , defined on the pairs x ˜ := (x, y). Problem (7) looks now as max(α) : Φ(α) := αi kϕ (˜ xi , x ˜i ) − αi αj kϕ (˜ xi , x ˜j ) (8) i∈I i,j∈I st: αi = 1, αi ≥ 0 ∀i ∈ I , i∈I
which only differs from the L2-SVM problem by the linear term i∈I αi kϕ (˜ xi , x ˜i ). If the kernel k used in L2-SVM classification satisfies the normalization condition1 k(xi , xi ) = Δ2 = xi , x ˜i ) = Δ2 + 1 + C1 := Δϕ 2 is also constant, it follows that kϕ (˜ constant. Since i∈I αi = 1, the first term of the objective function (7) becomes a constant and the problem of finding the optimal classifier becomes equivalent to the problem of computing a MEB by setting the kernel kϕ in (7) to kϕ (˜ xi , x ˜j ) = yi yj (k(xi , xj ) + 1) + δ(i, j)/C . 1
(9)
Note that this condition is straightforward for kernels of the form k(xi , xj ) = g(xi − xj )) such as a RBF kernel which is commonly used in practice. See [18] for constructions which do not require the normalization condition.
Two One-Pass Algorithms for Data Stream Classification
3
367
Classification of Data Streams Using MEBs
Online learners are mechanisms designed to learn continuously from a stream of data which can neither be predicted in advance nor completely stored before the learning process starts [4,5,9]. This data stream can be modeled as a sequence of input observations {xk : k ∈ I} indexed on I = {0, 1, . . . , T } and associated with an outcome sequence {yk : k ∈ I} which is aimed to be predicted by the learner. In contrast to the batch model presented previously, online learning takes place in a sequence of rounds. On each round, the learner observes an example xk and makes a prediction yˆk = hk−1 (xk ) using the current hypothesis hk−1 . The learner has then the chance of updating the current hypothesis by using information about the correct outcome yk presented usually in the form of a loss l(ˆ yk , yk ). An online kernel classifier generates hence a sequence of decision functions {hk } of parameters {wk , bk } which are updated according to the loss lk suffered by the algorithm at each round. Since the goal of an online learner is to make accurate predictions of the new coming inputs, online learners are typically designed to minimize the cumulated hinge-loss Lc ({hk }, S) = k lρk−1 (hk−1 (xk ), yk ) [9,4] along the sequence of observations, where {hk } denotes the sequence of hypothesis generated by the algorithm and ρk−1 is the margin parameter used by the algorithm before observing xk and lρ is defined in equation (3). Note that in this framework the loss of the algorithm is computed before the information about the correct outcome is revealed to the learner. 3.1
General Structure of the Method
Let Ik = {0, 1, . . . , k} and Sk = {˜ xi = (xi , yi ) : i ∈ Ik } be the subset of items revealed to the learner up to a round k and ϕ(Sk ) = {zi : i ∈ Ik } the corresponding image of Sk under the mapping induced by the L2-SVM kernel defined at equation (9). A naive approach to keep a classifier from the data stream may be to periodically compute the L2-SVM on Sk or (equivalently) the MEB of ϕ(Sk ) by solving αi kϕ (˜ xi , x ˜i ) − αi αj kϕ (˜ xi , x ˜j ) (10) max(α) : Φ(α) := i∈Ik i,j∈Ik αi = 1, αi ≥ 0 ∀i ∈ Ik . st: i∈Ik
This approach requires however the full storage of the stream and several passes through the data stream on the augmented dataset when new observations become available. The basic idea is hence to provide an efficient mechanism to approximate the MEB {B(c∗k , rk∗ )} of ϕ(Sk ) and recover a classifier from the sequence of approximating balls {Bk }. Denote by α∗k ∈ Rk+1 the solution of (10) and by α∗k,i one of its coordinates. As shown in [20], the primal variables c∗k and rk∗2 are hence given by c∗k = i∈Ik α∗k,i zi and rk∗2 = Φ(α∗k ). Given an approximation αk to α∗k ∈ Rk+1 we can thus define the approximating ball Bk = B(ck , rk ) at round k by setting
368
R. Ñanculef et al.
ck =
i∈Ik
αk,i zi =
i∈Ik
αk,i ϕ(˜ xi ) , rk2 = Φ(αk ) ,
(11)
and the corresponding SVM classifier using equation (6). It should be noted that if at a given round k, zk is already contained in the ∗ MEB ϕ(Sk−1 ), the current MEB is optimal, that is c∗k = c∗k−1 and rk∗ = rk−1 . We could hence implement the following test in order to decide when the current approximation Bk−1 needs to be updated: if zk ∈ Bk−1 we set Bk = Bk−1 , otherwise we take a step to improve the current approximation. However, following recent advances in algorithms to compute MEBs we build our approximations under the concept of (1 + )-MEB [20,3] and initiate an update if and only if zk ∈ / B(ck−1 , (1 + )rk−1 ) for some predefined > 0. Algorithm (1) summarizes the procedure. Note that the approximating ball is initialized as the true MEB of a small subset of s+1 observations. If s = 1, this MEB can be easily computed by setting c1 = 12 z0 + 12 z1 and r12 = 14 z0 − z1 2 .
1 2 3 4 5 6 7 8 9
Data: A stream {z0 , z1 , . . .} of featured observations zk = φ(˜ xk ) = φ(xk , yk ); an approximation tolerance > 0. Result: A sequence of approximating balls B1 , B2 , . . .. Δ2ϕ ←− z0 2 ; Set Bs = B(cs , rs ) to the MEB of the first s + 1 observations; for k = s + 1, s + 2, . . . do 2 if zk − ck−1 2 ≥ (1 + )2 rk−1 then Call an updating rule to compute ck and rk ; else Set ck = ck−1 and rk = rk−1 . end end Algorithm 1. Online Approximating Balls
3.2
Derivation of the First Rule (OFW)
Our first approximating rule is an online adaptation of the Frank-Wolfe optimization method, a very general procedure to find the optimum of concave function by using a constrained form of gradient ascent. This method has been studied in [20] and [6] for the fast computation of MEBs and SVMs respectively. At the beginning we initialize αk to α0k = (αTk−1 , 0)T which is equivalent to preserve the current approximating ball Bk−1 . Then the rule looks for the best improvement of the quadratic objective function Φ(α0k ) in the new direction k + 1 given by the last featured observation, that is ηk = arg max Φ (1 − η)α0k + ηek+1 , (12) η∈[0,1]
where ej denotes the j-th unit vector, that is the vector with all the components equal to zero, except the j-th component. Vector αk is then updated as αk = (1 − η)α0k + ηek+1 . Note that αk is always on the feasible space of (10), that is
Two One-Pass Algorithms for Data Stream Classification
369
xi , x ˜i ) = Δϕ for any i. Thus, i∈Ik αk,i = 1 for any k. On the other hand kϕ (˜ the objective function Φ(αk−1 ) can be written as Φ(αk−1 ) = Δϕ 2 − αk−1,i αk−1,j kϕ (˜ xi , x ˜j ) (13) i,j∈Ik−1
2
2
= Δϕ − ck−1 . and similarly, Φ((1 − η)α0k + ηek+1 ) = Δϕ 2 − ck 2 , by setting ck = (1 − ηk )ck−1 + ηk zk ,
(14)
η)α0k
(15)
αk = (1 −
k+1
+ ηe
.
Note that equation (14) gives an explicit rule to update the center of the current 2 ball. On the other hand, we have by construction rk−1 = Φ(αk−1 ) and thus the 0 k+1 first derivative of Φ (1 − η)αk + ηe equals zero by setting η to ηkofw :=
2 ck−1 − zk 2 − rk−1 ck−1 2 − zTk ck−1 = . 2ck−1 − zk 2 ck−1 − zk 2
(16)
Since the second derivative equals 2ck−1 − zk 2 > 0 and αk ∈ [0, 1] we have that the value of ηk given above is the solution of (12). Plugging in this value of ηk in equation (14) defines hence our first method to update the current approximating ball: ck = (1 − ηkofw )ck−1 + ηkofw zk , rk2 3.3
2
(17)
2
= Φ(αk ) = Δϕ − ck .
Derivation of the Second Rule (CNP)
Our second rule corresponds to a relaxation of the quadratic program which represents the optimal classifier at round k. We aim to determine Bk = B(ck , rk ) by first computing the minimal change in position of Bk−1 = B(ck−1 , rk−1 ) that puts the coming observation zk inside B(ck , rk−1 ) and then updating the radius to keep the primal-dual equation rk2 = Φ(αk ) = Δϕ 2 − ck 2 . This formulation replaces thus the quadratic program (10) by the simpler problem 2 min(ck ) : ck − ck−1 2 st: zk − ck 2 ≤ rk−1 .
(18)
The Lagrangian ofthis problem is given by L(ck , γk ) = ck − ck−1 2 + γk 2 zk − ck 2 ≤ rk−1 with multiplier γk ≥ 0. From the Karush-Kuhn-Tucker conditions [13] for optimality (dual-feasibility δL/δct+1 = 0) we have that ck = (1 − ηk )ck−1 + ηk zk ,
(19)
370
R. Ñanculef et al.
with ηk = γk /(1 + γk ), that is, the new rule has the same form of the rule previously introduced. Note now that γk = 0 because the point zk is not included in the current approximating ball. Thus the from the Karush-Kuhn-Tucker conditions (vanishing KKT-gap: γ · δL/δγ = 0) implies now that the solution of (18) is obtained by setting ηk to ck−1 − zk − rk−1 cnp ηk := . 2ck−1 − zk
(20)
Plugging in this value of ηk in equation (19) defines hence our second method to update the current approximating ball: cnp cnp ck = (1 − ηk )ck−1 + ηk zk ,
(21)
2
rk2 = Φ(αk ) = Δϕ − ck 2 .
4
Simulation Results
We simulate the task of data stream classification by sequentially presenting unseen data to the algorithm. This data was obtained from the following datasets: pendigits (7.4e+03 items, 10 classes), usps (7.2e+03 examples, 10 classes), Kddfull (4.9e+06 items, 2 classes), Ijcnn (4.9e+04 items, 2 classes) and extended Usps (2.6e+05 items, 10 classes). Datasets Kdd-full, Ijcnn and extended Usps (abbreviated as Usps-ext) were used as in previous research to test the largescale capabilities of CVMs [17] and are available at [16]. The other problems are available at [7] or [2]. SVMs were trained using a gaussian kernel k(x1 , x2 ) = exp(−x1 − x2 2 /σ2 ). Multicategory problems are addressed using a OVO scheme [13]. For datasets Kdd-full, Usps-ext and Ijcnn we used the hyper-parameter values reported in [17,15]. For the smaller datasets (≤ 104 examples) hyper-parameters were used according to the values reported in [6]. Algorithm (1) was initialized by randomly extracting a subset of s items corresponding to the 1 percent of the stream. The same criterion was used to simulate alternative algorithms. The method proposed in [12] was implemented and abbreviated here as CPB. The method based on the periodical computation of a new L2-SVM from the union old and coming observations is denoted as PB. Since this approach needs to solve large quadratic programs on the augmented datasets we only include this algorithm in the results for medium scale problems. Following this approach, the model is computed again after s new items have arrived to the system, corresponding to the 1 percent of the stream size. Note that a finer period should considerably increase time complexity since each time the model needs to be recomputed considering the complete sequence of previous observations. Additionally, we allow it to use warm-start: each time that the model needs to be recomputed the starting approximating ball is set to the previous approximating ball available in the system. Naturally, this should improve time complexity.
Two One-Pass Algorithms for Data Stream Classification
371
Tables (1) and (2) show the results obtained with the different problems and algorithms. The third column corresponds to the number of classification mistakes cumulated by the algorithms along the sequence of prediction/adjustment rounds. In order to assess computational complexity we report the total number of kernel evaluations carried out by the algorithm, that is, the number of times that the kernel function kϕ is evaluated on a pair of examples in order to make predictions and compute adjustments. Since this variable is platform independent, it is frequently employed to assess algorithmic complexity of kernel methods. The last column shows finally the total running times obtained on a 2.40GHz Intel Core 2 Duo with 2GB RAM running openSUSE 11.1. Table 1. Results obtained on medium-scale datasets Dataset
Rule
pendigits pendigits pendigits pendigits usps usps usps usps
OFW CNP CPB PB OFW CNP CPB PB
Cumulated Errors 242 599 754 59 437 790 917 185
Stream size (T) 7415 7415 7415 7415 7219 7219 7219 7219
Kernel Evals 1.56e+07 1.83e+07 1.72e+06 9.59e+09 1.55e+07 2.04e+07 1.72e+06 2.23e+10
Time (secs) 2.57 3 0.24 2541.56 15.96 22.65 1.22 22120.3
Table 2. Results obtained on large-scale datasets
5
Dataset
Rule
kdd-full kdd-full kdd-full usps-ext usps-ext usps-ext ijcnn ijcnn ijcnn
OFW CNP CPB OFW CNP CPB OFW CNP CPB
Cumulated Errors 2 2 7037 1 1 340 2133 2399 3808
Stream size (T) 4.89e+06 4.89e+06 4.89e+06 264748 264748 264748 49490 49490 49490
Kernel Evals 1.49e+08 1.49e+08 1.30e+08 7.82e+07 8.04e+07 7.26e+07 3.83e+08 3.55e+08 1.98e+07
Time (secs) 28.64 27.50 24.94 61.79 65.29 59.41 69.21 63.04 2.28
Conclusions
We have introduced two algorithms based on minimal enclosing balls to approximate SVM classifiers from streaming data using a single pass over each incoming item. According to the results of tables (1) and (2) the proposed methods are considerably more accurate than the single-pass method presented in [12] in all cases, at the price of a slightly greater computational complexity. Table (1) shows that the accuracy of the first method proposed in this paper (OFW) becomes
372
R. Ñanculef et al.
particularly closer to the accuracy obtained from the periodic recomputation of the model. This method is however based on multiple passes through the data items and its computational complexity is, as expected, several orders of magnitude worse than the complexity of single-pass methods.
References 1. Aggarwal, C. (ed.): Data Streams, Models and Algorithms. Springer, Heidelberg (2007) 2. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2010) 3. Clarkson, K.: Coresets, sparse greedy approximation, and the frank-wolfe algorithm. In: Proceedings of SODA 2008, pp. 922–931. SIAM, Philadelphia (2008) 4. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passiveaggressive algorithms. J. of Machine Learning Research 7, 551–585 (2006) 5. Dekel, O., Shalev-Shwartz, S., Singer, Y.: The forgetron: A kernel-based perceptron on a budget. SIAM Journal of Computing 37(5), 1342–1372 (2008) 6. Frandi, E., Gasparo, M.-G., Lodi, S., Ñanculef, R., Sartori, C.: A new algorithm for training sVMs using approximate minimal enclosing balls. In: Bloch, I., Cesar Jr., R.M. (eds.) CIARP 2010. LNCS, vol. 6419, pp. 87–95. Springer, Heidelberg (2010) 7. Hettich, S., Bay, S.: The UCI KDD Archive (2010), http://kdd.ics.uci.edu 8. Kivinen, J.: Online learning of linear classifiers, pp. 235–257 (2003) 9. Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. IEEE Transactions on Signal Processing 52(8), 2165–2176 (2004) 10. Lodi, S., Ñanculef, R., Sartori, C.: Single-pass distributed learning of multi-class svms using core-sets. In: Proceedings of the SDM 2010, pp. 257–268. SIAM, Philadelphia (2010) 11. Léon Bottou, D.D., Chapelle, O., Weston, J. (eds.): Large Scale Kernel Machines. MIT Press, Cambridge (2007) 12. Rai, P., Daumé, H., Venkatasubramanian, S.: Streamed learning: one-pass svms. In: IJCAI 2009: Proceedings of the 21st International Jont Conference on Artifical Intelligence, pp. 1211–1216. Morgan Kaufmann Publishers, San Francisco (2009) 13. Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001) 14. Shalev-Shwartz, S., Singer, Y.: A primal-dual perspective of online learning algorithms. Machine Learning 69(2-3), 115–142 (2007) 15. Tsang, I., Kocsor, A., Kwok, J.: Simpler core vector machines with enclosing balls. In: ICML 2007, pp. 911–918. ACM, New York (2007) 16. Tsang, I., Kocsor, A., Kwok, J.: LibCVM Toolkit (2009) 17. Tsang, I., Kwok, J., Cheung, P.-M.: Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research 6, 363–392 (2005) 18. Tsang, I., Kwok, J., Zurada, J.: Generalized core vector machines. IEEE Transactions on Neural Networks 17(5), 1126–1140 (2006) 19. Wang, D., Zhang, B., Zhang, P., Qiao, H.: An online core vector machine with adaptive meb adjustment. Pattern Recognition 43(10), 3468–3482 (2010) 20. Yildirim, E.A.: Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization 19(3), 1368–1391 (2008) 21. Zarrabi-Zadeh, H., Chan, T.M.: A simple streaming algorithm for minimum enclosing balls. In: Proceedings of the CCCG 2006 (2006)
X-ORCA - A Biologically Inspired Low-Cost Localization System Enrico Heinrich, Marian L¨ uder, Ralf Joost, and Ralf Salomon University of Rostock, 18051 Rostock, Germany {enrico.heinrich,marian.lueder,ralf.joost,ralf.salomon}@uni-rostock.de
Abstract. In nature, localization is a very fundamental task for which natural evolution has come up with many powerful solutions. In technical applications, however, localization is still quite a challenge, since most ready-to-use systems are not satisfactory in terms of costs, resolution, and effective range. This paper proposes a new localization system that is largely inspired by auditory system of the barn owl. A first prototype has been implemented on a low-cost field-programmable gate array and is able to determine the time difference of two 300 MHz signals with a resolution of about 0.02 ns, even though the device is clocked as slow as 85 MHz. X-ORCA is able to achieve this performance by adopting some of the core properties of the biological role model. Keywords: hardware implementation, robotics, architecture.
1
Introduction
Localization is a process in which some reference points, angles, and distances are used in order to determine the coordinates of new, so-far unknown points. For this task, nature provide several quite powerful solutions. One particularly interesting solution is provided by the auditory system of the barn owl [7]. This solution propagates the sensory information along some neural pathways across the owl’s brain. Since the two “wires” are anti-parallel, the attached phase (or correlation) detectors all observe different time delays between the two acoustic signals that originate from the owl’s ears. Section 2 proposes a technical model, called X-ORCA, that mainly adopts some of the main properties of the biological role model. Conceptually, the correlation neurons are modeled by phase detectors. Each phase detector consists of a simple XOR gate and a counter. The counter value represents the average fireing rate of the modeled neuron, and is displayed as a simple number. Internally, the system employs these phase detectors are placed along two anti-parallel “delay wired”. Since these wires go along opposite directions, all the phase detectors observed different signal phases as the barn owl’s auditory system does as well. In the domain of electrical engineering, electromagnetic signals are often prefered over acoustic ones, since they travel very large distances with high reliability and low energy consumption. However, electromagnetic signals travel with the speed of light c ≈ 3·108 m/s, which makes them quite challenging for every digital ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 373–382, 2011. c Springer-Verlag Berlin Heidelberg 2011
374
E. Heinrich et al.
system, if it comes to high resolutions: a difference in length of Δx = 1 cm, for example, corresponds to a time difference of Δt ≈ 33 ps. Because the X-ORCA system is intended to detect signal delays in the range of a few pico seconds, the aforementioned delay wires are made of regular passive wires, as can be found inside any digital circuit. A first prototype has been implemented on an Altera Cyclone II field programmable gate array (FPGA) [2]. Such an FPGA is a digital device, which consists of a very large number of simple logical gates. These gates can be propperly interconnected by using a hardware description language. Because of this hardware-oriented realization approach, such a system can be operated in situ. Section 3 provides all the technical implementation details as well as the experimental setup. The practical experiments are summarized in Section 4, and show that already this first X-ORCA prototype yields a resolution of about 0.02 ns. Finally, Section 5 concludes this paper with a brief discussion.
2
The X-ORCA Localization System
This section presents the X-ORCA architecture in three parts. The first parts starts off by clarifying the physical setup and all the assumptions made in this paper. Then, the second part explains X-ORCA’s core principles. In so doing, it makes a few assumptions that might seem practically implausible for some readers. However, the third part elaborates on how the X-ORCA architecture and the assumptions made in the second part can be fully realized on standard circuits. 2.1
Physical Setup and Preliminaries
Since the aim of a single X-ORCA instance is to determine the phase shift Δϕ between two incoming signals, it can be used as the core of a one-dimensional localization system. It thus adopts a standard setup (see, also, Fig. 1) in which a transmitter T emits a signal s(t) = A sin(2πf (t − t0 )) with frequency f , amplitude A, and time offset t0 . Since this signal travels with the speed of light c ≈ 3 · 108 m/s, it arrives at the receivers R1 and R2 after some delays Δt1 = (L + Δx)/c and Δt2 = (L − Δx)/c. Both receivers employ an amplifier and a Schmitt trigger, and thus feed the X-ORCA system with the two rectangular signals r1 (t − t0 ) and r2 (t − t0 ) that both heave frequency f . By estimating the phase shift Δϕ between these two signals r1 (t − t0 ) and r2 (t − t0 ), X-ORCA then determines the time difference Δt = t1 − t2 = Δϕ/(2πf ), in order to arrive at the transmitter’s off-center position Δx = Δtc/2. It might be, though, that both the physical setup and the X-ORCA system have further internal delays, such as switches, cables of different lengths, repeaters, and further logical gates. However, these internal delays are all omitted, since they can be easily eliminated in a proper calibration process. Furthermore, for a real-world three-dimensional scenario, the X-ORCA system has to be simply duplicated twice.
X-ORCA - A Biologically Inspired Low-Cost Localization System
-L
0
L
x
x
Tx
t2= L-c x
t1= L+c x R1 r1(t)
375
R2
t1
X-ORCA Q
r2(t)
t2
Q(t)
T
Fig. 1. X-ORCA assumes a standard, one-dimensional setup in which the time difference Δt = t1 − t2 = 2Δx/c is a result of the transmitter’s off-center position Δx. It indirectly determines Δt = Δϕ/(2πf ) by estimating the phase shift Δϕ between the two incoming signals r1 (t) and r2 (t).
2.2
The System Core
Essentially, the X-ORCA core consists of a large number of independently operating phase detectors. One of these phase detectors is illustrated in Fig. 2. It consists of a logical XOR and a counter. The XOR “mixes” the two input signals s1 and s2 , and yields a logical 1 or a logical 0 on whether the two signals differ or not. In other words, the degree of how both signals differ from each other corresponds to the phase shift Δϕ, and is represented as the proportion
signal from R1
XOR
reset system clock
Counter
X-ORCA signal from R2
phase/corellation value
Fig. 2. An X-ORCA phase detector consists of a logical XOR (or any other suitable binary logic function), which “mixes” the two input signals s1 and s2 , and an additional counter to actually determine the phase shift Δϕ
376
E. Heinrich et al.
r1(t)
r2(t) XORCA
XORCA
j
k
phase indicator
i
XORCA
i
j
k
counter index
Fig. 3. X-ORCA places all phase detectors along two reciprocal (anti-parallel) “delay” wires w1 and w2 on which the two signals r1 (t) and r2 (t) travel with approximately two third of the speed of light cw ≈ 2/3c. Because the two wires w1 and w2 are reciprocal, all phase detectors have different internal delays τi .
of logical 1’s per time unit. This proportion is evaluated by the counter that is attached to the XOR gate. For example, let us assume an input signal with a frequency of f = 100 MHz and a phase shift of Δϕ = π/4 = 45◦ . Then, if the counter is clocked at a rate of 10 GHz over a signal’s period T = 1/(100 MHz) = 10 ns, the counter will assume a value of v = 25. At this point, three practical remarks should be made: (1) The XOR gate has been chosen for pure educational purposes; any other suitable binary logic function, such as AND, NAND, OR, and NOR, could have been chosen as well. (2) A counter clock rate of 10 GHz is quite unrealistic for technical reasons, but Subsection 2.3 shows how such clock rates can be virtually achieved. (3) A result of a phase shift Δϕ = π/4 = 45◦ , for example, is intrinsically ambiguous, since the system cannot differ between p = π/4 = 45◦ and p = −π/4 = −45◦ . In order to solve the ambiguity of a single phase detector, X-ORCA simply employs more than just one. Figure 3 shows that X-ORCA places all phase detectors along two reciprocal (anti-parallel) “delay” wires w1 and w2 on which the two signals r1 (t) and r2 (t) travel with approximately two third of the speed of light cw ≈ 2/3c. Because the two wires w1 and w2 are reciprocal, all phase detectors have different internal delays τi which always add to the external delay Δt = 2Δx/c that is due to the transmitter’s off-center position Δx. As a consequence, each phase detector i observes an effective time delay Δt + τi and thus a phase shift Δϕi = 2πf (Δt + τi ). Further post-processing stages become particularly easy, if the internal delays τimax − τimin = T = 1/f span the entire range of a period T of the localization signal s(t). For a first estimate of the transmitter’s off-center position Δx it would suffice to determine the phase detector i that has the smallest counter value vimin = min{vi }; only those phase detectors i have a counter value close to zero for which the condition τi ≈ −ΔT holds.
X-ORCA - A Biologically Inspired Low-Cost Localization System
R1
R1
R1
R2
R2
R2
Q
Q
Q
377
Fig. 4. Due to the inherent rise and fall times, a change in a gate’s output requires some time. Therefore, if the input frequency increases too much or if the input edges come too close together, the gate cannot properly change its output (right-hand-side).
Furthermore, in case all phase detectors are sorted in an ascending order, i.e., τi ≤ τi+1 , the counter values vi assume a V-shaped curve. Thus, X-ORCA might also be utilizing all phase detectors for reconstructing Δx by, for example, calculating the best-fitting-curve. 2.3
Real-World Implementation Details
The description presented in Subsection 2.2 has made a few, practically unrealistic assumptions, which are more or less concerned with the maximal frequency f that can be processed by the phase detectors. First of all, the X-ORCA concept has assumed that the clock frequency clk ≥ 100 × f is at least 100 times higher than the frequency of the localization signal s(t) in order to achieve a practically relevant resolution. A signal frequency of f = 100 MHz, for example, would require a clock frequency of at least clk = 10 GHz. Such a clock frequency, however, would be way too unrealistic for low-cost devices, such as FPGAs. In case of periodic localization signals, however, a virtually very high frequency can be achieved by a technique, known as unfolding-in-time [6]. Let us assume, for example, a signal with frequency f and thus a period of T = 1/f . Then, the samples could be taken at 0, t, 2t, . . . , (n − 1)t, with t = T /n denoting the interval between two consecutive samples, and n denoting the number of samples per signal period T . Then, unfolding-in-time means that the samples are taken at 0, (t+T ), 2(t+T ), . . . , (n−1)(t+T ). That is, the sampling process is expanded over an extended interval with duration nT . Moreover, unfolding-in-time does not necessarily stick to an increment of “t + T ”. For example, the samples can also be taken at 0, (kt + T ), 2(kt + T ), . . . , (n − 1)(kt + T ), with k denoting a constant that is prime to n. The second assumption concerns the electrical transition behavior of the XOR gates as well as the counters. The conceptual description of Subsection 2.2 implicitly assumes that gates and counters are fast enough to properly process signals that travel along the internal wires with about two third of the speed of light. The technical suitability of this approach might be surprising to some readers but has already been shown by previous research [8]. That research has also shown that due to technical reasons, such as thermal noise, the logic gates do not yield exact results but that they exhibit a rather random behavior if, for
378
E. Heinrich et al.
example, set and hold time requirements are not met. This random effect can be statistically compensated, for example, by a large number of processing elements, which is another reason for employing a large number of phase detectors in the X-ORCA architecture. The third implementation remark concerns the processing speed of the gates and the input parts of the counters. Figure 4 shows that if the phase shift gets too small (or too close to 180◦), the rise and fall times prevent the gate from properly switching its output state. This effects lead to small errors of the counter values, if the phase shift Δϕ is close to zero or 180◦ ; as a result, the expected V-shaped curve of the counter values (subsection 2.2) might change to a U-shape.
3
Methods
The first X-ORCA prototype was implemented on an Altera Cyclone II FPGA [2]. This device offers 33,216 logic elements and can only be clocked at about 85 MHz. The chosen FPGA development board is a low-cost device that charges about 500 USD. On the top-level view, the X-ORCA prototype consists of 140 phase detectors, a common data bus, a Nios II soft core processor [3], and a system PLL that runs at 85 MHz. The Nios II processor manages all the counters of the phase detectors, and reports the results via an interface to a PC. Due to the limited laboratory equipment, the transmitter, its localization signal s(t), the two receivers R1 and R2 , and their distances to the transmitter are all emulated on the very same development board. The transmitter and its localization signal s(t) is realized by means of a second PLL, which runs at 300 MHz, whereas the receivers and physical distances are realized by means of some active delay lines. It should be noted, though, that X-ORCA’s internal “delay wires” w1 and w2 are realized as pure passive internal wires, connecting the device’s logic elements, as previously announced in Subsection 2.2. In a second experiment, the prototype utilized an external 19 MHz signal and emulated the transmitter-to-receiver distances by external line stretchers [1].
4
Results
Figures 5-8 summarize the experimental results that the first X-ORCA prototype has achieved under different configurations. Unless otherwise stated, the figures present the counter values vi of n = 140 different phase detectors, which were clocked at a rate of 85 MHz. In Fig. 5, the prototype was exposed to two 300 MHz (localization) signals that have a zero phase shift Δϕ = 0. The input signals were sampled 1,000,000 times, which corresponds to an averaging over 196 periods, with virtually 5100 samples per period of the localization signal (please, see also the discussion presented in Subsection 2.3). It can be clearly seen that the minimum is at counter #31 and that the counters to the left and right have larger values as can be expected from
X-ORCA - A Biologically Inspired Low-Cost Localization System
379
800000 700000 600000
CountterValue
500000 400000 300000 200000 100000 0 1
11
21
31
41
51
61 71 81 CounterIndex
91
101 111 121 131
Fig. 5. The figure shows the counter values vi of n = 140 phase detectors when fed with two 300 MHz signals with zero phase shift Δϕ = 0
800000 700000 'M q
600000
'M
CountterValue
500000
q
400000 'M q 300000 200000 100000 0 1
11
21
31
41
51
61 71 81 CounterIndex
91
101 111 121 131
Fig. 6. The figure shows the counter values vi of n = 140 phase detectors when fed with two 300 MHz signals with zero phase shift Δϕ = 0 (solid line), with −43◦ phase shift Δϕ = −43◦ (dotted line), and with +43◦ phase shift Δϕ = +43◦ (dashed line)
X-ORCA’s internal architecture. In addition, Fig. 5 reveals some technological FPGA internals that might be already known to the expert readers: neighboring logic elements do not necessarily have equivalent technical characteristics and are not interconnected by a regular wire grid. As a consequence, the counter values vi and vi+1 of two neighboring phase detectors do not steadily increase or decrease, which makes the curve look a bit rough. Figure 6 shows the results of the prototype when the two input signals have one of the following three time delays Δt = t1 − t2 ∈ {−0.4 ns, 0 ns, +0.4 ns}. It can be clearly seen that a time delay of 0.4 ns shifts the “counter curve” by about 20 counters. This observation suggests that the prototype would be able to detect a time delay as small as Δt = 0.02 ns.
380
E. Heinrich et al.
3500000 3300000
CounterValue
3100000
'M q
2900000 'M q 2700000
'M q
2500000 2300000 1
11
21
31
41
51
61
71
81
91
101 111 121 131
CounterIndex
Fig. 7. The figure shows the counter values vi of n = 140 phase detectors when fed with two 19 MHz signals with zero phase shift Δϕ = 0 (dashed line), with about −0.3◦ phase shift Δϕ ≈ −0.3◦ (solid line), and with about +0.3◦ phase shift Δϕ ≈ +0.3◦ (dashed line) 480 470
De elayIndicatorValue
460 450 440 430 420 410 400 390 0
5
10
15
20
25
AdjustableDelayincm
Fig. 8. The figure shows the delay value indicator resulting from adjustable delay line lengths when fed with two 19 MHz signals
A closer look at Figs. 5 and 6 reveals that the graphs are not exactly Vshaped but rather U-shaped at the very bottom. This is because the effects already discussed in Fig. 4 come into effect. Figure 7 shows the behavior of the X-ORCA architecture when using the external 19 MHz localization signal. In this experiment, one of the connections from the function generator to the input pad of the development board was established by a line stretcher [1], whereas the other one was made of a regular copper wire. Figure 7 shows the values vi of the n = 140 counters, which were still clocked at 85 MHz over a measurement period of 10,000,000 ticks. The three
X-ORCA - A Biologically Inspired Low-Cost Localization System
381
graphs refer to a phase shift of Δϕ ∈ {−0.3◦, 0◦ , +0.3◦ }, which corresponds to time delays Δt ∈ −0.15 ns, 0 ns, +0.15 ns. It should be noted that the graph of this figure appears as a straight line, since the internal time delays τi span much less than an entire period of the 19 MHz signal, which is significantly lower than the previously used 300 MHz signal (both experiment have used exactly the same X-ORCA system). Figure 8 presents a different of Figure 7: In the graph, every dot represents the sum vtot = i vi of all n = 140 counter values vi ; that is, an entire graph of Fig. 7 is collapsed into one single dot. The graph shows 29 measurements in which the line stretcher was extended by 1 cm step by step. It can be seen, that a length difference of Δx = 1 cm decreases vtot by about 20. This result suggests that with a localization of 19 MHz, X-ORCA is able to detect a length difference of about Δx = 1 mm, which equals a time resolution of about 0.015 ns.
5
Discussion
This paper has presented a new localization architecture, called X-ORCA. Its main purpose is the localization of transmitters, such as WLAN network cards or Bluetooth dongles, that emit electromagnetic signals. In its core, X-ORCA consists of a large number of very simple phase detectors, which are mounted along two passive wires with very small but finite internal time delays. This large number of rather unreliable phase detectors allows X-ORCA to perform a rather reliable statistical evaluation. The X-ORCA architecture has been havily inspired by the biological role model, i.e., the auditory system of the barn owl. In this adaptation process, XORCA relys on a large number of rather unreliable simple phase detectors, which exhibit rather unreliable results. However, by averaging over a large number of entities, as the role model suggests, X-ORCA arrives at a quite reliable and accurate result. Since the role model’s neurons were emulated in re-configurable, physical hardware, the system is able to process electromagnetic signals, rather than acoustic signals. The switch in the utilized media is of practical importance for many real-world applications, such as the localization of persons and/or objects in laboratory environments. Unfortunately, the available laboratory equipment did not allow to test the true limits of the first prototype. This particularly applies to the maximal frequency f of the localization signal and to the achievable resolution with respect to Δx. These tests will be certainly subject of future research. Future research will also be devoted to the integration of wireless communication modules. The best option seems to be the utilization of a software-defined radio module, such as the Universal Software Radio Peripheral 2 (USRP2) [5]. Finally, future research will port the first prototype onto more state-of-the-art development boards, such as an Altera Stratix V FPGA [4].
382
E. Heinrich et al.
Acknowledgements The authors gratefully thank Volker K¨ uhn and Sebastian Vork¨ oper for their helpful discussions. This work was supported in part by the DFG graduate school 1424. Special thanks are due to Matthias Hinkfoth for valuable comments on draft versions of the paper.
References 1. Microlab: Line Stretchers, SR series. Datasheed, Microlab Company (2008) 2. Altera Corp., San Jose, CA. Nios Development Board Cyclone II Edition Reference Manual. Altera Document MNLN051805-1.3 (2007) 3. Altera Corp., San Jose, CA. Nios II Processor Reference Handbook. Altera Document NII5V1-7.2 (2007) 4. Altera Corp., San Jose, CA. Stratix V Device Handbook. Altera Document SV5V11.0 (2010) 5. Ettus Research LLC, http://www.ettus.com 6. Hertz, J., Krogh, A., Palmer, R.G.: Introduction to the Theory of Neural Computation. Addison-Wesley Pub. Co., Redwood City (1991) 7. Kempter, R., Gerstner, W., van Hemmen, J.L.: Temporal coding in the submillisecond range: Model of barn owl auditory pathway. Advances in Neural Information Processing Systems 8, 124–130 (1996) 8. Salomon, R., Joost, R.: Bounce: A new high-resolution time-interval measurement architecture. IEEE Embedded Systems Letters (ESL) 1(2), 56–59 (2009)
On the Origin and Features of an Evolved Boolean Model for Subcellular Signal Transduction Systems ˇ 1 , Monika Avbelj2 , Roman Jerala2, and Andrej Dobnikar1 Branko Ster 1
Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, 1000 Ljubljana, Slovenia
[email protected] 2 National Institute of Chemistry, Hajdrihova 19, 1001 Ljubljana, Slovenia
Abstract. In this paper we deal with the evolved Boolean model of the subcellular network for a hypothetical subcellular task that performs some of the basic cellular functions. The Boolean network is trained with a genetic algorithm and the obtained results are analyzed. We show that the size of the evolved Boolean network relates strongly to the task, that the number of output combinations is decreased, which is in concordance with the biological (measured) networks, and that the number of noncanalyzing inputs is increased, which indicates its specialization to the task. We conclude that the structure of the evolved network is biologically relevant, since it incorporates properties of evolved biological systems. Keywords: Subcellular networks, Simulation, Genetic algorithms, Regression.
1
Introduction
Recent studies in biochemistry, molecular biology and information processing networks have opened an important area of research: analysis and modeling of intracellular signal-transduction networks [1,2,3,4]. The main goal is to understand the origin, the features and the information processing of subcellular networks of genes or proteins. It has already been shown with the help of simulations that a discrete Boolean model of the signal-transduction network is able to simulate intracellular mappings from surface receptors to an output set of genes or proteins [5]. In that case, however, the logic tables of the nodes within the Boolean model and the interactions between the nodes and/or input receptors were taken from an extensive experimental work and a huge set of network elements performing a simple classification task [6]. In this paper we show some preliminary results of the evolved Boolean model of the subcellular network for a hypothetical subcellular task. We found that a) the number of nodes of evolved Boolean network and its number of inputs per node k are considerably related to the size of the task b) the number of ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 383–392, 2011. c Springer-Verlag Berlin Heidelberg 2011
384
ˇ B. Ster et al.
output attractors decreases significantly with the evolution of the model, c) the number of non-canalyzing combinations is greatly increased in the evolved model, d) the structure of the evolved model is biologically reasonable. Our results are in concordance with those experimentally obtained [5]. The origin of the Boolean network for subcellular tasks via evolution and its informationprocessing function of nontrivial clustering are the main contributions of the paper. The resultant structure for the pre-defined task also gives some insight into the natural features of the network. The paper is organized as follows. In Chapter 2 we give some background of the Boolean model of the subcellular biological signal-transduction network and describe an example subcellular task. Chapter 3 details the evolution of the model and gives the main results of the procedure related to the case-study task, together with the features of the evolved structure. In the conclusion, we comment on the results and open some new ideas for future work.
2
Boolean Model of Subcellular Signal-Transduction Network
The main objective of the Boolean network modeling is to study generic coarsegrained properties of large subcellular signal-transduction networks. In particular, the logical functions of nodes (genes or proteins) and their interactions are investigated via ’goal-oriented evolution’, where the ’goal’ is a subcellular task to be performed by the network and the evolution describes some natural search procedure for the proper structure of the model. The functions of the nodes and their connections are unknown (random) at the beginning of the evolution. The result of the procedure gives the proper functions of the nodes and their interconnections such that the task is performed properly. Searching for the right structure of the Boolean model is a huge combinatorial problem, even for a rather small task or a correspondingly small network. Considering that only input receptors and output nodes are known for some realistic subcellular task, the obvious unknowns are: number of hidden nodes, set of possible node functions, number of inputs to the nodes, topology of the network or connectivity plan, etc. Fortunately, some simplifications that do not significantly change the nature of the problem are possible. Instead of the complete set of possible functions of the nodes, only the set of biologically relevant [7] canalyzing functions is considered, which results in a substantial reduction n of the set, from 22 to 2 · 2n (Table 1a), where n is the number of inputs to the nodes. It is well known that NAND and NOR are universal logical and also canalyzing functions. A function is canalyzing if in all but one input combinations, only one input variable defines the output value. Table 1b illustrates all possible canalyzing functions with two input variables, n = 2, where active inputs and outputs take all possible combinations. Obviously c1 = NOR and c8 = NAND. For example, in c1 = i1 ↓ i2 = (i1 ∨ i2 ) = i1 · i2 , the active value for any input is 1, which activates the majority output value 0, in c8 = (i1 · i2 ) = i1 ∨ i2 , the active value for both inputs is 0, but the activated output is 1. In c3 = i1 ·i2 , active
On the Origin and Features of an Evolved Boolean Model
385
Table 1. Number of possible canalyzing functions (a) and the set for n = 2 (b) n 1 2 3 4
n
22 2 · 2n 4 4 16 8 256 16 64k 32 a.
i1 0 0 1 1
i2 0 1 0 1
c1 1 0 0 0
c2 0 1 0 0
c3 0 0 1 0
c4 0 0 0 1
c5 0 1 1 1
c6 1 0 1 1
c7 1 1 0 1
c8 1 1 1 0
b.
i1 is 0 and active i2 is 1, and the active output is 0. Canalyzing functions can also be described as functions that are closest to the constant functions, as only a single input combination (also called the non-canalyzing input combination) of all input variables, leads to the other (non-active) function value. Another simplification of the large combinatorial problem follows from the reduced set of possible functions of the nodes in the Boolean network. As only one input variable to the node (with a canalyzing function) defines the output value in most cases, we can limit our search to the networks with a constant number of input variables for all nodes, denoted with k, which is clearly the important parameter of the evolving procedure. A Boolean network can be described as a directed graph G = (V, E), where V is a set of vertices or nodes and E a set of oriented edges, where each edge is an ordered pair of nodes. It is convenient to label the set of nodes with integers, V = (1, 2, .., v) for a graph of v nodes, and link (j, i) represents an directed link from node j to node i. A graph with v nodes is completely specified by a v × v matrix, C = (cij ), which is called the adjacency matrix of the graph. cij is the i-th row and j-th column element of C and is equal to unity if E contains a directed link (j, i), and zero otherwise. The adjacency matrix C is non-negative because it has no negative entries, which implies the existence of a real eigenvalue λ (root of the characteristic equation of adjacency matrix: |C − λI| = 0) of an eigenvector x = (x1 , .., xv ) of C, provided Cx = λx. It is possible to study the presence or absence of closed paths in a graph from the largest real eigenvalue λ1 (Perron-Frobenius theorem) [8] in the following way: 1. no closed path if λ1 (C) = 0, 2. closed path if λ1 (C) ≥ 1. Node i in the graph performs a particular Boolean function fi from the list of all possible logical (canalyzing) functions with only two possible values (states), True(1) or False(0). The global state of the network in discrete time t is presented with the set of all current function values of the nodes, F (t) = (f1 (t), ..., fv (t)). The dynamics of the network are given by the sequence F (t), F (t+1), F (t+2), ..., which is the consequence of the current state F (t) and the current value of the inputs (receptor) vector. Because of the general topology of the network, with possible closed loops (cycles) between nodes, it is possible that the network responds to the different inputs with different sequences of different lengths, where the length is the number of responding global states from the starting state to the attractor state or cycle. The attractor state is the global state that no longer
386
ˇ B. Ster et al.
changes providing the input is not changed. The attractor cycle is a sequence of several global states that continues to change periodically. By considering responses to different global states at different inputs, one can observe some interesting information processing (nontrivial clustering), which differs significantly between the starting (random) and the evolved Boolean network. For the purpose of illustrating the subcellular modeling, we use a hypothetical system mimicking important properties of an organism, that should be executed by the unknown network, and can be described as follows. The network has nine receptor inputs, five of them representing different danger signals: D1 and D2 - bacterial infection, D3 and D4 - viral infection, D5 - cellular injury, and four representing different sources of energy (food): F1 - proteins, F2 - carbohydrates, F3 - lipids, F4 - sugar. In this way the system is equipped with the possibility of either increasing or decreasing the energy according to fitness. The system can respond to the input signals through seven outputs for activating different metabolic and defence genes that can help the organism to respond to the danger and utilize the available food sources: MG - general metabolic gene, PG - protein metabolism, CG - metabolism of carbohydrates, TG - lipid metabolism, DG generalized defense against danger, BG - defense against bacteria, VG - defense against viruses. The logical mapping of the task is shown in Table 2, together with the probabilities, which are based on the energy consumption/acquisition of the network. There are 29 = 512 possible inputs in the table. Only nine of them are basic (first group) and have biologically established outputs. For all other input combinations, the reasonable outputs are superimposed relative to the basic entries (second group), with an exception in the case of simultaneously active Fs and Ds. In that case, only the influence of Ds is considered. For example, if D1 and D3 are active (1), then DG, DB and VG are set, while if D1 , D2 and F1 are active, then only DG and BG are taken into account, while MG and PG are ignored. Table 2. Input output table of the network for the task under discussion; p is probability of the entry in the table, based on energy consumption/acquisition D1 1 0 0 0 0 0 0 0 0
D2 D3 D4 D5 F1 F2 F3 F4 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 more than 1 danger more than 1 food combinations of Ds & Fs
DG BG VG MG PG CG TG p 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0.6 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 superposition of outputs superposition of outputs 0.4 Fs are ignored
On the Origin and Features of an Evolved Boolean Model
387
For sustainable performance (in the sense of retained energy) of the network, the probabilities for the two groups of entries in the truth table are derived. They are used for proper selection of the entries from the table during the evolution and operation. The probabilities are related to the energy consumptions/acquisitions, which are based on the biological reasoning and therefore have a direct influence on the resulted (evolved) networks. In the case of food (without danger), any active output from the inputs Fi , i = 1, ..., 4, increases the energy of network by 10 if the appropriate gene to utilize this resource is activated, while any output activated due to some danger decreases energy by 5 unless the appropriate danger response is activated, in which case the energy does not decrease. Using the input-output table for the given task (Table 2), a training set was constructed as follows. With probability of p1 , only a single input was active at a time, and with probability of 1−p1 random inputs were activated, each appearing with a probability of 0.25. In the latter case, due to the binomial distribution, two or three inputs were mostly activated. When two or more inputs were activated at the same time, the target outputs were obtained by superposition of individual outputs (OR function). Besides, when a danger was present, the food inputs were ignored (Table 2). We must find the value of p1 . The conservation of energy can be written as Pf o ΔEf o + Pd ΔEd = 0 ,
(1)
where Pf o is the probability of food only (without danger), and Pd = 1 − Pf o is the probability of danger (food may also be present, but is ignored). ΔEf o = 10 and ΔEd = −5. Pf o may be written as 4 p1 + (1 − p1 ) P9 (k) 9 4
Pf o =
k=1
k 4 , 9
(2)
where p1 is the unknown probability of any single input and 1 − p1 is the probability of more inputs, each with probability of 0.25. Since the number of combinations is generally n k n−k Pn (k) = p q , (3) k we have
9 P9 (k) = 0.25k 0.759−k . k
(4)
From Eq. 2 we find p1 = 0.6 and 1 − p1 = 0.4 for the two groups in Table 2, respectively. For different energy values these probabilities would be different.
3
Evolution and Experimental Results
A genetic algorithm was applied to search for a Boolean network that responds with correct outputs, given the inputs from the training set. It was assumed that each processing element (node) has a unit delay. Due to possible delays in
388
ˇ B. Ster et al.
the network, it is normal for the output to stabilize after some time. For this reason, the output was considered after a delay corresponding to the number of nodes on the path from input to output (five in case of network with N = 15, six if N = 20 and N = 25; see Figures 2, 3 and 4, respectively). Besides, due to possible attractor loops of length more than one, the output is checked as many times as there are global states in the maximal attractor cycle. The evaluation function of the genetic algorithm was simply the number of errors at the (binary) outputs, that is, the number of incorrect classifications. Each node had k inputs from global inputs or other nodes and a logic value of a canalyzing function. If k = 3, this means that in the part of the genotype that relates to the node, there are 3 input values of non-canalyzing combination and corresponding output function value. For example, when k = 3, combination 0001 means that only for inputs 000 is the output 1 (for every other combination the output is 0). An individual chromosome consisted of this information for all the nodes. In the genetic algorithm we applied roulette wheel parent selection. The crossover type was uniform with the probability of 0.2, while the mutation inverted individual bits with the probability of 0.01. Two input parameters to the network were applied: the number of nodes (N ) and the number of inputs to a node (k). For each combination of selected N and k, 10 repetitions gave 10 Boolean networks during 20,000 generations of the genetic algorithm. Table 3 shows output errors. It is clear from this that Table 3. Average absolute error (standard deviation) over 10 separately evolved Boolean networks after 20,000 generations of the genetic algorithm. For N = 15 the output is considered after a delay of 5. N is the number of nodes and k is the number of inputs to a node. k/N 2 3 4 5
10 226 (111) 96.0 (42.3) 89.0 (45.0) 109 (49.0)
15 15.0 (0.0) 13.5 (4.7) 13.5 (4.7) 32.5 (14.0)
20 15.0 (0.0) 7.2 (7.6) 9.0 (7.7) 25.0 (7.1)
25 12.0 (6.3) 10.5 (7.2) 12.0 (6.3) 30.0 (0.0)
30 13.1 (4.8) 7.5 (7.9) 16.5 (4.7) 34.5 (12.6)
35 13.5 (4.7) 13.5 (4.7) 28.5 (4.7) 48.0 (11.1)
the lowest error was obtained with the combination of k = 3 and N = 20. However, since we are interested in solving the task completely, it is interesting to know how many of these networks have zero error (Table 4). This can also be represented graphically (Fig. 1). The more successful networks have greater probability of being ’implemented’ in the cells, than others. The organisms with the evolved feature will be more frequent and will therefore ’survive’. Biologically, networks with higher number of nodes (proteins or genes) and high interconnectivity (a) use a lot of time to solve a simple task and (b) involve synthesis of unnecessary inner node (proteins), which represents an unnecessary energetic burden to the cell. Therefore, networks with high N and k are eliminated. On the other hand, networks with smaller number of nodes are unable to fulfill the task at all and are also eliminated. For an organism to evolve, energy conservation and survival are important, yet it must still retain the ability to
On the Origin and Features of an Evolved Boolean Model
389
Table 4. Number of fully successful Boolean networks (i.e. with zero error) out of 10
number of successful networks
k/N 2 3 4 5
10 0 0 0 0
15 0 1 1 0
20 0 5 4 0
25 2 3 2 0
30 1 5 0 0
35 1 1 0 0
5 4 3 2 1 0 −1 5 35
4
30 25
3 k
20 2
15 10
N
Fig. 1. Number of fully successful Boolean networks (spline interpolation)
adapt to environmental changes. Therefore, the best networks solve the task with a minimum number of nodes, which are redundant so as to retain the ability to overcome errors. The smallest Boolean network in our simulations had N =15 nodes, 7 outputs and 8 internal units (=15-7), from C0 to C7 (Fig. 2). The logical equations that show non-canalyzing input combinations of all 15 nodes are in Table 5. Table 5. Logical equations showing non-canalyzing input combinations of all 15 nodes DG = C4 C1 C6 BG = C4 C2 C2 V G = C6 C1 C6 M G = C4 C0 C1 P G = C4 C0 C5 CG = C0 C4 C3 T G = C1 C2 C0
C0 C1 C2 C3 C4 C5 C6 C7
= F3 F4 C7 = D3 D4 D5 = F3 C1 D1 = C6 C6 F2 = D1 D2 D5 = C1 F3 F1 = F2 F3 F2 = F2 C5 C4
The Boolean network with N = 15, k = 3 is shown in Fig. 2. Internal nodes are structured into layers, in accordance with the cumulative delay from the input nodes. λ1 of the adjacency matrix C for the network in Fig. 2 is 0, which
390
ˇ B. Ster et al.
Fig. 2. Boolean network with N = 15 nodes (8 internal nodes)
Fig. 3. Boolean network with N = 20 nodes (13 internal nodes)
On the Origin and Features of an Evolved Boolean Model
391
Fig. 4. Boolean network with N = 25 nodes (18 internal nodes). Nodes C0 and C17 have no outputs.
means according to the Perron-Frobenius theorem that there are no closed paths in the network. We were also interested in the proportion of non-canalyzing input combinations during the processing of the network. The greater the number, the more restricted or specialized is the network. For this network, it was found to be 0.228 (standard deviation 0.033), i.e. on average 22.8% of all the combinations in the network were non-canalyzing. For comparison, initial random networks had 0.111 (standard deviation 0.054), i.e. 11.1%, non-canalyzing combinations. It is obvious that successfully trained networks have a much larger proportion of noncanalyzing combinations than networks with randomly connected canalyzingfunction nodes. We also compared the ratio between the number of different inputs and the number of different outputs (regression ratio) for trained and for random networks. This ratio for the trained networks was 8.1 and for random networks 0.93, and hence evolution of our networks increased regression. Higher regression means that the network was performing the classification task by mapping different input patterns into the same output pattern (label). This feature was comparable to the measurements of the real subcellular structures [5], which means that it is biologically relevant. In summary, evolved networks show specialization and the ability to filter a larger number of stimuli (inputs) into one response (output), characteristics significant for biological systems. Fig. 3 shows a larger network with N = 20, k = 3 and 13 internal units, from C0 to C12 . It still contains no loops (λ1 = 0). The regression ratio is the same as before. Fig. 4 shows a network with N = 25 nodes, k = 3. This network contains many loops and λ1 = 1.84. The regression ratio is again the same.
392
4
ˇ B. Ster et al.
Conclusion
In the paper a Boolean model of a subcellular signal-transduction system has been presented. The network was evolved using a genetic algorithm. The example task was a hypothetical subcellular task involving response to food and danger with energy increasing and energy decreasing inputs, respectively. We have shown that the number of non-canalyzing combinations in the evolved models is greatly increased, meaning therefore its specialization, and that the structures exhibit the classification feature, typical for the real subcellular networks. The evolved models therefore have a biological grounding. In future work, we would like to investigate the structures of the evolved models and compare them with the experimentally determined subcellular networks, which, however, often cannot be completely isolated from the rest of the system. Our goal is to investigate the evolution of biological systems networks and find conditions within the evolving procedure and an evaluation (fitting) function that would assure a one-to-one mapping between the two structures.
References 1. Kauffman, S.A.: The Origins of Order. Oxford Univ. Press, Oxford (1993) 2. Aldana, M., Cluzel, P.: A natural class of robust networks. Proceedings of the National Academy of Sciences USA 100(15), 8710–8714 (2003) 3. Shmulevich, I., Dougherty, E.R., Zhang, W.: From Boolean to Probabilistic Boolean Networks as Models of Genetic Regulatory Networks. Proc. of IEEE 90(11), 1778– 1792 (2002) 4. Shmulevich, I., Dougherty, E.R., Kim, S., Zhang, W.: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 18(2), 261–274 (2002) 5. Helikar, T., Konvalina, J., Heidel, J., Rogers, J.A.: Emergent decision-making in biological signal transduction networks. Proceedings of the National Academy of Sciences USA 105(6), 1913–1918 (2008) 6. SI.txt: http://mathbio.unomaha.edu/Database 7. Kauffman, S., Petersen, C., Samuelsson, B., Troein, C.: Random Boolean network models and the yeast transcriptional network. Proceedings of the National Academy of Sciences USA 100(14), 14796–14799 (2003) 8. Jain, S., Krishna, S.: Graph theory and the evolution of autocatalytic networks. In: Bornholdt, S., Schuster, H.G. (eds.) Handbook of Graphs and Networks. Wiley, Chichester (2002)
Similarity of Transcription Profiles for Genes in Gene Sets Marko Toplak1 , Tomaž Curk1 , and Blaž Zupan1,2 1 2
Faculty of Computer and Information Sciences, University of Ljubljana, Slovenia Dept. of Human and Mol. Genetics, Baylor College of Medicine, Houston, USA
Abstract. In gene set focused knowledge-based analysis we assume that genes from the same functional gene set have similar transcription profiles. We compared the distributions of similarity scores of gene transcription profiles between genes from the same gene sets and genes chosen at random. In line with previous research, our results show that transcription profiles of genes from the same gene sets are on average indeed more similar than random transcription profiles, although the differences are slight. We performed the experiments on 35 human cancer data sets, with KEGG pathways and BioGRID interactions as gene set sources. Pearson correlation coefficient and interaction gain were used as association measures. Keywords: gene transcription profile, association, interaction gain, gene sets, KEGG, BioGRID.
1
Introduction
Much of the current data analysis in bioinformatics relies on existing knowledge on groupings of objects of interests. For instance, Gene Ontology [2] annotates genes with terms from the ontology and a group of interest may simply be a set of genes tagged with the same term. Among others, Kyoto Encyclopedia of Genes and Genomes (KEGG) [11] lists metabolic pathways and identifies genes that belong to the same pathway. BioGRID [17], on the other hand, provides information on protein-protein and genetic interactions. Genes encoding the proteins may be grouped together if their proteins interact. Such groups of objects, which are most commonly genes, proteins, chemicals, and metabolic products, enable various knowledge-based data analysis techniques [4]. Typical analyses of this kind are gene set enrichment [15] and classification based on gene set signatures [12,14]. Both are useful for gene transcription profile analysis, where the task is either to find if a chosen gene group has a specific transcription response, or prediction of responses for uncharacterized samples with transformation of the data set to gene set space beforehand. The backing for such knowledge-based data analysis approaches is an assumption that genes belonging to the same group have similar transcription profiles. Genes encoding interacting proteins are more similar that random genes if Pearson correlation coefficient is used to measure association [5,7,9]. It was shown A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 393–399, 2011. c Springer-Verlag Berlin Heidelberg 2011
394
M. Toplak, T. Curk, and B. Zupan
on data of baker’s yeast that transcription profiles of genes encoding interacting proteins behave similarly and that genes encoding proteins for permanent complexes, such as ribosome or proteasome, have a particularly similar transcription profiles [9]. Other studies on a small number of data sets confirmed these findings while focusing on coevolution gene expressions [7] or comparison between multiple species [5]. Another study reports no difference of similarities between genes in KEGG pathways and random genes [10]. A study on 60 data sets looked at patterns of correlating genes across data sets and compared aggregated results with background knowledge from Gene Ontology, but did not evaluate individual data sets [13]. In the paper we present a computational analysis of association between gene transcription profiles for genes in gene sets on a wide array of data sets. To measure gene profile association, we used the Pearson correlation coefficient and interaction gain [8], which is an information theory based supervised measure of association. Compared to related work, we performed the same test over a wide array of data sets and, additionally, used interaction gain to measure association.
2
Data
Gene expression data. Gene expression microarray data consists of mRNA levels for thousands of genes for each biological sample. We used 35 human cancer gene expression data sets from the Gene Expression Omnibus (GEO) [3] and the Broad Institute. All data sets have two diagnostic classes and include at least 20 instances, where each class was represented by at least 8 data instances. On average, the data sets include 44 instances (s.d.= 29.6). GDS data sets with the following ID numbers were used: 806, 971, 1059, 1062, 1209, 1210, 1220, 1221, 1282, 1329, 1375, 1390, 1562, 1618, 1650, 1667, 1714, 1887, 2113, 2201, 2250, 232, 2415, 2489, 2520, 2609, 2735, 2771, 2785 and 2842. The Broad Institute data sets are described on the supplemental page of our previous paper (http://www.ailab.si/supp/bi-cancer/projections/index.htm); we used leukemia, DLBCL, prostate, GSE412, and GSE3726 data sets. Where the array contained multiple probes for the same gene, they were averaged. Gene sets. BioGRID [17] version 2.0.51 was used as a source of gene sets for protein-protein interactions. Pathways from KEGG [11] were obtained on 16 August 2010.
3
Methods
In this section we describe measures used to evaluate transcription profile associations and the experimental methodology. 3.1
Transcription Profile Association Measures
Pearson correlation. The Pearson product-moment correlation coefficient [16] was used as a gene transcription profile association measure in many related
Similarity of Transcription Profiles for Genes in Gene Sets
395
studies [5,7,9,13]. It determines the degree of linear relationship between two transcription profiles. Interaction gain. The interaction gain, also known as bivariate synergy, estimates information about the class that is gained by considering two transcription profiles together as compared to when they are considered separately [1,8]. Two similar gene transcription profiles will have a negative interaction gain as both carry approximately the same class information. Interaction gain of two transcriptional profiles X and Y with respect to class C is defined as IntGainC (X, Y ) = GainC (X × Y ) − GainC (X) − GainC (Y ), where GainC (X) denotes information gain of profile X with respect to class C and X × Y is a cartesian product of transcription profiles. Information gain is defined as GainC (X) = − p(c)log2 p(c) + p(v) p(c|v)log2 p(c|v), c∈DC
v∈DX
c∈DC
where DC and DX denote sets of class and attribute values. Gene expressions were discretized into three intervals with equal frequencies prior to computation of interaction gain. 3.2
Experimental Methodology
For each data set we measured the degree of association between pairs of gene transcription profiles, where both genes were in the same gene set - either in protein-protein interaction (BioGRID) or a biological pathway (KEGG). The scores obtained were compared to scores between random gene pairs (in the same data set) with a two-sample Kolmogorov-Smirnov test as in [5,7]. A two-sample Kolmogorov-Smirnov test is a nonparametric test, which quantifies whether two samples of values come from the same underlying distribution. It measures the maximum distance between cumulative distributions of the samples’ values and takes sample sizes into account for p-value computation [16]. The Orange data mining environment [6] was used to perform the analysis.
4
Results and Discussion
Table 1 presents two-sample Kolmogorov-Smirnov p-values for all data sets. For Pearson correlation, 32 data sets have p-values lower than 0.001 for KEGG pathways and 31 for BioGRID interactions, while for interaction gain the numbers are 20 and 14, respectively. Association score distributions for three data sets are shown in Figure 1. Gene transcription profiles of genes in gene sets are more correlated than random genes, which augments previous protein-protein interaction focused studies [5,7,9]. The differences in distributions of correlation coefficients are slight,
396
M. Toplak, T. Curk, and B. Zupan
Fig. 1. Histograms showing degree of association between genes in KEGG pathways (yellow) and random genes (blue). Pearson correlations are shown in left column while interaction gains are shown in the right column.
Similarity of Transcription Profiles for Genes in Gene Sets
Table 1. Two-sample Kolmogorov-Smirnov p-values for all data sets
DLBCL GDS1059 GDS1062 GDS1209 GDS1210 GDS1220 GDS1221 GDS1282 GDS1329 GDS1375 GDS1390 GDS1562 GDS1618 GDS1650 GDS1667 GDS1714 GDS1887 GDS2113 GDS2201 GDS2250 GDS232 GDS2415 GDS2489 GDS2520 GDS2609 GDS2735 GDS2771 GDS2785 GDS2842 GDS806 GDS971 GSE3726 GSE412 leukemia prostata
Pearson correlation BioGRID KEGG 6.7·10−87 3.5·10−183 2.0·10−5 1.1·10−4 1.5·10−14 6.2·10−11 −66 1.5·10 4.6·10−86 −4 7.4·10 7.8·10−32 2.7·10−1 2.2·10−2 −16 5.5·10 8.0·10−36 −28 4.4·10 5.3·10−19 −15 3.7·10 1.6·10−7 2.9·10−30 1.0·10−46 −3 8.6·10 2.3·10−6 −2 9.8·10 4.1·10−3 −111 1.9·10 1.2·10−277 4.9·10−37 3.3·10−138 −37 1.9·10 1.8·10−66 −47 5.9·10 1.9·10−119 −3 1.0·10 3.6·10−7 1.6·10−4 1.1·10−7 −6 5.6·10 3.6·10−4 −64 8.3·10 5.6·10−90 −2 6.0·10 5.1·10−5 6.0·10−41 < 1.0·10−318 1.2·10−2 7.7·10−3 −6 7.2·10 1.1·10−37 2.1·10−72 < 1.0·10−318 7.9·10−40 4.8·10−96 −9 9.8·10 1.7·10−32 −233 3.4·10 1.5·10−48 7.2·10−6 1.1·10−4 7.8·10−50 6.0·10−299 −6 1.9·10 2.4·10−18 −238 2.6·10 1.2·10−318 4.9·10−163 2.7·10−28 9.3·10−57 6.0·10−72 −5 8.9·10 5.5·10−6
Interaction gain BioGRID KEGG 8.4·10−5 2.4·10−86 1.3·10−1 2.4·10−2 4.7·10−1 1.1·10−15 −15 2.6·10 2.0·10−1 −1 4.5·10 4.5·10−2 1.1·10−5 3.9·10−16 −1 6.8·10 2.3·10−2 −57 6.4·10 1.6·10−5 −5 9.2·10 8.0·10−27 2.1·10−39 6.8·10−113 −1 2.9·10 1.2·10−4 −1 7.3·10 1.6·10−1 −114 4.8·10 < 1.0·10−318 2.2·10−18 3.2·10−49 −21 3.6·10 5.9·10−3 −1 4.4·10 4.7·10−1 −1 4.2·10 2.6·10−6 3.3·10−1 3.1·10−6 −2 1.2·10 8.3·10−2 −2 6.3·10 6.9·10−3 −2 8.2·10 7.6·10−1 6.6·10−1 3.3·10−2 −1 1.4·10 6.5·10−3 −2 4.7·10 3.9·10−5 3.9·10−87 < 1.0·10−318 1.0·100 6.1·10−2 −1 2.5·10 2.7·10−13 −7 6.3·10 1.4·10−7 7.8·10−2 1.5·10−1 9.2·10−2 5.3·10−1 −10 6.3·10 2.6·10−30 −8 5.1·10 1.3·10−32 3.5·10−3 1.5·10−11 −2 1.7·10 5.8·10−9 −4 4.0·10 8.1·10−9
397
398
M. Toplak, T. Curk, and B. Zupan
as written in [9], albeit the p-values are very small due to a large number of scores in distribution samples. The absolute values of pairwise correlations between genes from KEGG were slightly higher than those from BioGRID, which is in contrast with [10], who did not find genes from KEGG pathways noticeably more correlated than genes chosen at random. Positive correlation between genes from evaluated gene set sources is more common than negative, which could be due to biological reasons [13]. The distribution of interaction gain scores for gene pairs from evaluated gene sets was shifted slightly towards negative scores, which means that such pairs of gene transcription profiles provide overlapping information about the class. On average, the p-values were higher than with Pearson correlation. This might be due to the small number of biological samples in data sets, because we need more samples to measure interaction gain accurately. While negative Pearson correlation is more common in tested gene groups than between random gene pairs, positive interaction gain is not. This was expected, because if it was more common in general, this would imply that we need completely different knowledge-based analysis techniques. We hypothesize that positive interaction gain is more common between different gene groups.
5
Conclusion
Our analysis confirms that gene transcription profiles of genes from gene sets from KEGG or BioGRID are more related than those defined arbitrarily, which is in line with previous research [5,7,9]. Our contributions to the topic are the high number of data sets evaluated and the use of another association metric. While we were able to consistently detect the differences between distributions of association scores between genes from the same gene sets and genes chosen at random, the differences were only slight. This may be one of the reasons for relatively disappointing results of classification methods based on gene set signatures, where higher prediction accuracies were expected [14].
References 1. Anastassiou, D.: Computational analysis of the synergy among multiple interacting genes. Mol. Syst. Biol. 3(83) (February 2007) 2. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., et al.: Gene ontology: tool for the unification of biology. Nature genetics 25(1), 25–29 (2000) 3. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Edgar, R.: NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucl. Acids Res. 35, D760–765 (2007) 4. Bellazzi, R., Zupan, B.: Towards knowledge-based gene expression data mining. Journal of Biomedical Informatics 40(6), 787–802 (2007) 5. Bhardwaj, N., Lu, H.: Correlation between gene expression profiles and protein– protein interactions within and across genomes. Bioinformatics 21(11), 2730 (2005)
Similarity of Transcription Profiles for Genes in Gene Sets
399
6. Demšar, J., Zupan, B., Leban, G.: Orange: From experimental machine learning to interactive data mining, white paper (2004) 7. Fraser, H., Hirsh, A., Wall, D., Eisen, M.: Coevolution of gene expression among interacting proteins. Proceedings of the National Academy of Sciences of the United States of America 101(24), 9033 (2004) 8. Jakulin, A., Bratko, I.: Analyzing attribute dependencies. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 229–240. Springer, Heidelberg (2003) 9. Jansen, R., Greenbaum, D., Gerstein, M.: Relating whole-genome expression data with protein-protein interactions. Genome Research 12(1), 37 (2002) 10. Jelizarow, M., Guillemot, V., Tenenhaus, A., Strimmer, K., Boulesteix, A.: Overoptimism in bioinformatics: an illustration. Bioinformatics 26(16), 1990 (2010) 11. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., Hirakawa, M.: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Research 38(Database issue), D355 (2010) 12. Lee, E., Chuang, H.Y., Kim, J.W., et al.: Inferring pathway activity toward precise disease classification. PLoS Comput. Biol. 4(11), e1000217 (2008) 13. Lee, H., Hsu, A., Sajdak, J., Qin, J., Pavlidis, P.: Coexpression analysis of human genes across many microarray data sets. Genome Research 14(6), 1085 (2004) 14. Mramor, M., Toplak, M., Leban, G., Curk, T., Demšar, J., Zupan, B.: On utility of gene set signatures in gene expression-based cancer class prediction. In: Machine Learning in Systems Biology, p. 65 (2009) 15. Nam, D., Kim, S.Y.: Gene-set approach for expression pattern analysis. Brief Bioinform 9(3), 189–197 (2008) 16. Sheskin, D.: Handbook of parametric and nonparametric statistical procedures. CRC Pr I Llc (2004) 17. Stark, C., Breitkreutz, B., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: BioGRID: a general repository for interaction datasets. Nucleic Acids Research 34(suppl. 1), 535 (2006)
Author Index
´ Abrah´ am, Erika I-190 Abundez B., Itzel M. I-51 Alfaro, Rodrigo II-61 Allende, H´ector II-61, II-363 Antunes, M´ ario II-342 Avbelj, Monika II-383 ¨ am¨ Ayr¨ o, Sami I-361 Babi´c, Zdenka II-51 Bakirov, Murat B. I-150 Barszcz, Tomasz II-225 Baumann, Martin R.K. I-140 Beigi, Akram I-391, II-98, II-245 Beliczynski, Bartlomiej I-130 Bielecka, Marzena II-147, II-225 Bielecki, Andrzej II-147, II-225 Bratko, Ivan I-1 Buesser, Pierre II-167 Buli´c, Patricio I-158 Campos, Jo˜ ao I-300 C´ ardenas-Montes, Miguel I-310, I-371 Carvalho, Rui I-300 Constantinopoulos, Constantinos I-169 Correia, Manuel II-342 Costa, Ernesto I-300 Cristianini, Nello II-196, II-322 Cruz R., Rafael I-51 Curk, Tomaˇz II-393 Daolio, Fabio II-167 Daryabari, Mojtaba I-381 Datadien, Arvind I-90 de Almeida, Ana II-31, II-295 de Azevedo da Rocha, Ricardo Luis II-127, II-275 De Bie, Tijl II-196 Deng, Jianming I-320 Ding, Xiao-Feng II-118 Dobnikar, Andrej II-11, II-383 Dokur, Z¨ umray II-81 Donnarumma, Francesco I-250 Duch, Wlodzislaw II-89
Eiben, A.E. II-186 El-Dahb, Mona A. I-400 Ferariu, Lavinia I-290 Figueiredo, Marisa B. II-31 Filipiˇc, Bogdan I-420 Flaounas, Ilias II-322 Frolov, Alexander A. I-100 F´ uster-Sabater, Amparo II-285 Fyson, Nick II-196 Gasca A., Eduardo I-51 G´ ati, Krist´ of II-156 G´ omez-Iglesias, Antonio I-310, I-371 Gong, Fang II-118 Govekar, Edvard I-270 Grochowski, Marek II-89 Groˇselj, Ciril I-80 Haselager, Pim I-90 Hashemi, Ali B. I-340 Heinrich, Enrico II-373 Helmi, Hoda I-391 Hensinger, Elena II-322 Horv´ ath, G´ abor II-156 Husek, Dusan I-100 Ilc, Nejc II-11 ˙ scan, Zafer II-81 I¸ J¨ arvelin, Kalervo I-260 Jerala, Roman II-383 Joost, Ralf II-373 Juhola, Martti I-260 Kaczorek, Tadeusz II-305 Kainen, Paul C. I-12 K¨ arkk¨ ainen, Tommi I-240 Karshenas, Hossein II-98 Kester, Leon J.H.M. II-186 Kiselev, Mikhail I-120 Kocijan, Juˇs I-420, II-312 Kolodziej, Marcin I-280 Kononenko, Igor I-22, I-169, II-21 Korkosz, Mariusz II-147
402
Author Index
K¨ oster, Frank I-140 Kotulski, Leszek II-254 Kovord´ anyi, Rita I-200 Kruglov, Igor A. I-150 Kukar, Matjaˇz I-80 K˚ urkov´ a, Vˇera I-12 Laurikkala, Jorma I-260 L awry´ nczuk, Maciej I-31, I-230 Lemmer, Karsten I-140 Leonardis, Aleˇs II-235 Lethaus, Firas I-140 Likas, Aristidis I-169 Lipi´ nski, Piotr I-330 Lodi, Stefano II-363 Lopes, Noel II-41, II-108 Lotriˇc, Uroˇs I-158 Loyola, Diego I-70 L¨ uder, Marian II-373 Luostarinen, Kari I-240 Majkowski, Andrzej I-280 Marusak, Piotr M. II-177, II-215 Matos, Lu´ıs I-410 Meybodi, Mohammad Reza I-340 Minaei, Behrouz I-381, I-391, II-98 Mishulina, Olga A. I-150 Momi´c, Snjeˇzana II-51 Montone, Guglielmo I-250 Morin, Gabriel I-190 Mozayani, Nasser II-245 Muhonen, Jukka I-240 ˜ Nanculef, Ricardo II-363 Nechval, Konstantin II-136 Nechval, Nicholas II-136 Neme, Antonio I-210 Neruda, Roman I-180 Neto, Jo˜ ao Pedro I-61 Neumann, Heiko I-110 Ni, Qingjian I-320 Nido, Antonio I-210 Nieminen, Paavo I-240 Noroozi, Vahid I-340 Novo, Jorge I-350 Nunes, Jorge I-410 Olszewski, Dominik II-1, II-71 Orchel, Marcin II-332, II-353 Ortman, Robert L. I-220
Osowski, Stanislaw I-41 ¨ ¨ Ozkaya, Ozen II-81 Parsa, Saeed I-381 Parvin, Hamid I-381, I-391, II-98, II-245 Patelli, Alina I-290 Pazo-Robles, Maria Eugenia II-285 Penedo, Manuel G. I-350 Petelin, Dejan I-420, II-312 Pevec, Darko I-22 Polyakov, Pavel Yu. I-100 Potoˇcnik, Primoˇz I-270 Potter, Steve M. I-220 Prevete, Roberto I-250 Purgailis, Maris II-136 Quintas, Ricardo
II-41
Rak, Remigiusz J. I-280 Rend´ on L., Er´endira I-51 Ribeiro, Bernardete II-31, II-41, II-108, II-342 Richter, Pascal I-190 Ringbauer, Stefan I-110 Risojevi´c, Vladimir II-51 ˇ Robnik-Sikonja, Marko I-169 Rodr´ıguez-V´ azquez, Juan Jos´e I-310 Rozevskis, Uldis II-136 Saarikoski, Jyri I-260 Saifullah, Mohammad I-200 Sait, Sadiq M. I-400 Salda˜ na T., Sergio I-51 Salomon, Ralf II-373 S´ anchez G., Jos´e S. I-51 Santos, Jos´e I-350 Sartori, Claudio II-363 Schuessler, Olena I-70 Schut, Martijn C. II-186 S¸edziwy, Adam II-254 Shi, Ai-Ye II-118 Shibata, Danilo Picagli II-127 Shiraishi, Yoichi I-400 Siddiqi, Umair F. I-400 Silva, Catarina II-342 Silva, Fernando I-61 Silva Filho, Reginaldo Inojosa II-275 Sim˜ oes, Anabela I-300 Siwek, Krzysztof I-41 Skoˇcaj, Danijel II-235 Skomorowski, Marek II-147
Author Index Sprinkhuizen-Kuyper, Ida I-90 Stolarek, Jan I-330 Szupiluk, Ryszard II-206 ˇ Ster, Branko II-383 ˇ Strumbelj, Erik I-22, I-169, II-21 Tirronen, Ville I-361 Tomassini, Marco II-167 Toplak, Marko II-393 Trigo, Ant´ onio I-410 Tschechne, Stephan I-110
Vel´ asquez G., Valent´ın I-51 Venayagamoorthy, Kumar I-220 Vidnerov´ a, Petra I-180 Vreˇcko, Alen II-235 Wang, Hui-Bin II-118 Weber, Matthieu I-361 Wojciechowski, Wadim II-147 W´ ojcik, Mateusz II-225 Wojewnik, Piotr II-206 Xu, Li-Zhong
Unold, Olgierd
II-118
II-265
Valdovinos R., Rosa M. I-51 van Willigen, Willem H. II-186 Vega-Rodr´ıguez, Miguel A. I-310, I-371
Zabkowski, Tomasz II-206 Zhang, Xue-Wu II-118 Zieli´ nski, Bartosz II-147 Zupan, Blaˇz II-393
403